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Multiword Expressions: Insights from a 
multi-lingual perspective 


Manfred Sailer 
Goethe University Frankfurt/Main 


Stella Markantonatou 
Institute for Language and Speech Processing, Athena RIC, Greece 


In this introductory chapter, we present the basic concept of the volume at hand. 
The central aspects of the individually contributed chapters are sketched and some 
of the relations among the chapters are pointed out. 


1 Introduction 


Multiword expressions (MWEs) are not only a challenge for natural language ap- 
plications, they also present a challenge to linguistic theory. This is so because, 
for the vast majority of them, their structure can be predicted by the grammar 
rules of the language to which they belong while the semantics of a substantial 
subset of MWESs is unpredictable or fixed. Therefore, MWEs often defy the ap- 
plication of the machinery developed for free combinations where the default is 
that the meaning of an utterance can be predicted from its structure. 

There is a rich body of primarily descriptive work on MWEs for many Euro- 
pean languages but there is little comparative work in this area extending on 
descriptive, theoretical, and computational issues. This volume brings together 
MWE experts with individual languages as their background to explore the ben- 
efits of a multi-lingual perspective on MWES, as regards all the dimensions of 
linguistic research: descriptive coverage, theoretical scrutiny, and computational 
exploitation. 


Manfred Sailer & Stella Markantonatou. 2018. Multiword Expressions: In- 
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We assume a broad concept of MWE in this volume, using MWE as the cover 
term for any kind of phraseological unit. As such, it comprises idioms, colloca- 
tions, complex names, phraseological patterns, etc. We chose the term MWE as 
the default in this volume, but use its competitors interchangeably with it where 
no confusion arises. Each contribution will specify explicitly within which em- 
pirical sub-domain of phraseology it is located. 

We hope that this introductory chapter will help the book to gain easier ac- 
cess to a wider audience and will place it within the current state of research in 
phraseology and on multiword expressions. We thought that two general issues 
about this book should be addressed here: the variety of linguistic formalisms 
used and the general research issues discussed. 

The book contains contributions from various linguistic frameworks. Since the 
individual contributions are relatively short, we consider it useful to provide a 
brief overview over the frameworks. 

We will identify some general research questions that we see either promi- 
nently emerging in the field or as topics that should be addressed in the future 
and will show how the contributions in this volume address some of these issues. 
The multi-lingual perspective will serve as a guiding principle in the choice of 
topics. Of course, our perspective may well be biased due to personal preferences 
and limitations. 

Wherever it seems useful, we will point out links between the papers in this 
volume and show in which respect they point in the same direction or seem to 
reach mutually incompatible conclusions - a strong proof of the lively ongoing 
discussions in the MWE field! 

It is a privilege for us that this book appears as one of the first volumes in the 
new Language Science Press series Phraseology and Multiword Expressions. We 
hope that it will pave the way for future books in this series that will take up 
some of the questions that are addressed here. 


2 Topics in multi-lingual MWE research 


In this section, we will briefly address three aspects that play an important role 
in the contributions to this volume: MWE classification, methods and issues in 
multi-lingual MWE research, and aspects of individual MWE types. In each of 
the following subsections, we will introduce the basic question and sketch how 
contributions in this volume address it. 
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2.1 Classifications of MWEs 


The classification of MWEs is a challenge. Even more so, as there is no general 
consensus about what counts as an MWE. Burger (2015) characterises phraseolo- 
gical units by three properties: polylexicality, fixedness, and idiomaticity, where 
idiomaticity need not be present in all phraseological units. Fleischer (1997) views 
phraseology as a fuzzy concept with polylexicality as the only obligatorily pres- 
ent criterion. He assumes three further prototypical properties that define the 
fuzzy concept. As the term prototypical suggests, these properties can be present 
or absent to various degrees. These properties are fixedness, idiomaticity, and lex- 
icalisation. Idioms of the type kick the bucket ‘die’ are the core cases of phrasemes, 
satisfying all three criteria. Collocations (open the door) may lack idiomaticity, 
phraseological patterns (as goes X so goes Y) may not be fully lexicalised. 

The concept that an expression can be a gradually more or less typical repre- 
sentative of an MWE has been generalised to the extreme in most versions of 
Construction Grammar (Fillmore et al. 1988). This framework abandons the split 
between Lexicon and Grammar and replaces them with a Constructicon that con- 
sists of more or less general and complex constructions. In this view, traditional 
lexical entries are specific but simple constructions, and classical rules of gram- 
mar are general but complex constructions. MWEs, such as idioms, are found in 
a middle position of this continuum, being rather specific and, at the same time, 
quite complex constructions. Consequently, it is impossible to define MWEs in 
this framework - which has, of course, been a conscious design decision in Con- 
struction Grammar. Baldwin & Kim (2010) come from a different angle. For them, 
MWE-hood is in the eye of the beholder: we need to define what we assume to 
be the "rule" (at any level of linguistic description or language use), and anything 
that deviates from the rule in one way or another will be classified as an MWE. In 
this view, the degree of irregularity or idiosyncrasy of an MWE can be observed, 
but it will be a yes/no split as to what counts as an MWE and what does not. 

So far, we have discussed three attempts to define the boundaries of the do- 
main of MWE research. All of them have proven fruitful in research, and we do 
not see a point in choosing one over the other in abstracto. We can, however, un- 
derstand the differences if we look at the underlying purpose of the definitions. 
Fleischer (1997) is in the tradition of the Soviet phraseological research. There, 
phraseology is considered the third pillar of linguistics, complementing the Lexi- 
con and Grammar by looking at objects that have both lexical and phrasal proper- 
ties. Fillmore et al. (1988) developed their theory in opposition to the very abstract 
universalist ideas in the Chomskyan paradigm. Finally, Baldwin & Kim (2010) 
have concrete computational applications in the back of their minds such as the 
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extraction of MWEs. If there were no difference between MWEs and free com- 
binations, it would be impossible (or meaningless) to build a database of MWEs. 
The insight that emerges from these considerations is that we need to clarify in 
which context and for which purpose a characterisation and, as we will see in 
a second, a classification of MWEs has been proposed. Rather than adopting or 
rejecting a proposal in general, we should examine critically how far a proposal 
is suitable relative to our own current framework and research question. 

To be on the most inclusive side, let us assume that the domain of MWE re- 
search consists of any expression that contains more than one basic lexical el- 
ement and that is lexicalised, fixed, idiomatic, or irregular in one way or the other. 
This results in a highly heterogeneous set of expressions. Consequently, we need 
to structure this huge empirical domain by imposing a classification on it. Just as 
before, however, there is no hope of finding a single classification or taxonomy 
of MWEs that can be used for all purposes. Nonetheless, some proposed classifi- 
cations are better than others. This evaluation will need to take into account the 
purpose of the classification. Parsing, MWE extraction, cognitive representation, 
second language learning, machine translation, and many other purposes can be 
thought of. In all these domains, MWEs pose highly intriguing challenges, but it 
is unlikely that the same classification will be useful for all of them. 

For illustration, we can look at a number of classificatory criteria that have 
been proposed in the literature and show that they are essential for some, but, 
probably, relatively useless for other purposes. Makai (1972) distinguishes be- 
tween idioms of decoding and idioms of encoding. The first class of idioms con- 
tains expressions that can only be understood if they are known to the hearer. 
This is the case for expressions such as kick the bucket 'die', but less so for ex- 
pressions like answer the phone or brush one's teeth. Idioms of encoding are ex- 
pressions that need to be known in order to produce them. All three examples 
given would count as idioms of encoding, since it is an arbitrary convention that 
the idea of doing dental hygiene is expressed as brush one's teeth in English 
rather than as clean one's teeth. In German, it is the other way around, with 
Zühne putzen 'teeth clean' rather than Záhne bürsten 'teeth brush' being con- 
ventionalised, even though the instrument to brush your teeth with is called a 
Zahnbürste ‘toothbrush’ in German, just as it is in English. 

The distinction between a decoding and an encoding perspective is clearly 
useful for parsing versus generation, but also for designing MWE collections for 
foreign language learners, who need both types of MWEs, in contrast to MWE 
collections for native speakers, which usually contain only idioms of decoding. 
For the purpose of a computational system for automatic MWE extraction, how- 
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ever, this distinction is completely immaterial; actually, it would be misguiding 
to evaluate an MWE extraction system with respect to its success in categorizing 
MWES correctly as decoding or encoding MWEs. 

Syntactic flexibility is a classificatory criterion that has been widely relied 
upon for retrieving, cataloguing, and parsing MWEs. Whether or not an MWE 
can appear in a number of different constructions, or, from a different point of 
view, can undergo some transformations, has been a central concern of treat- 
ments of idioms in Generative Grammar (see Fraser 1970, for example). One of 
the most cited works in the computationally oriented MWE literature, namely 
“Multiword Expressions: A pain in the neck for NLP" (Sag et al. 2002), is about 
the classification of MWEs in terms of syntactic flexibility. This criterion also 
plays a central role in the contributions to this volume by Kuiper, Laporte, Parra 
Escartín et al., Bargmann & Sailer, and Markantonatou & Samaridi - although 
the last two contributions are rather interested in the ability of MWEs to appear 
in different constructions and its theoretical ramifications than in classification 
per se. Syntactic flexibility remains a concern in classifications that are computa- 
tionally oriented and rely on more criteria, for instance, classifications that draw 
on the syntactic function of MWEs: Parra Escartín et al. (2018 [this volume]) clas- 
sify MWESs in terms of both syntactic flexibility and syntactic function (namely, 
whether an MWE functions as noun, verb or adjective/adverb). 

Typically at least two degrees of flexibility are distinguished, telling apart kick 
the bucket-type expressions, which cannot undergo passivisation, from spill the 
beans-type MWEs, which can. This is a core distinction for formal theories of 
idioms such as the one in Generalized Phrase Structure Grammar (Gazdar et al. 
1985) or in Nunberg et al. (1994). After all, passivisation has the status of a ma- 
jor diagnostic in linguistic theory. For instance, back in the early 80's, newborn 
Lexical Functional Grammar (LFG) relied on passivisation in order to advocate 
lexicalism and to define grammatical functions that are important axioms of the 
particular theory. Passivisation is discussed by several contributors to this vol- 
ume, and opinions vary widely. Markantonatou & Samaridi (2018 [this volume]), 
who work within the LFG framework, draw on passivisation, as it seems to be 
able to split Greek MWE data nearly into two. Bargmann & Sailer (2018 [this 
volume]), on the other hand, argue that, in the right context, most/all English 
MWES can passivise. Other languages, such as German, impose even fewer or no 
restrictions on MWE passivisation. It is on these grounds that, according to them, 
passivisation is neutralised as a universal classificatory diagnostic for MWEs, but 
it may be valid in individual languages. 
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Laporte (2018 [this volume]) argues explicitly that the flexibility criterion of 
classification is highly problematic because it actually points to an ensemble of 
syntactic behaviours and, to this moment, there has been no reliable research 
on exactly how this collective behaviour of diagnostics defines flexibility as a 
measurable property. It must be said, though, that Laporte does not so much 
claim that it is not possible to classify MWEs in terms of syntactic flexibility; 
rather, his argument is that, for the classification of MWEs in terms of a multi- 
dimensional feature such as syntactic flexibility, an important amount of data 
about different MWEs and the application of classification methods are required. 
These methods will apply over sets of features that receive binary values (+/—), 
that is, over categorical variables. 

The approaches to syntactic flexibility we have discussed so far are categorical 
in nature. They ask whether an MWE can participate in a phenomenon or not, 
but they are not interested in the actual usage of the phenomenon. Of course, 
syntactic flexibility can be seen from the point of usage: an MWE that is fre- 
quently used with a structural "twist", even if it is the same "twist" most of the 
time, is it a syntactically flexible one or not? Hanks et al. (2018 [this volume]) 
argue for a quantitative definition of syntactic flexibility that takes into account 
the frequency of structural variations (“twists”) of an MWE in a corpus and the 
reported first results suggest that there is little agreement between the "theoreti- 
cally” and the “frequency” inspired notion of syntactic flexibility. 


2.2 Multi-lingual studies of MWEs 


Every multi-lingual or cross-lingual study of MWEs is confronted with a num- 
ber of questions. First, in order to be able to compare a phenomenon across lan- 
guages, some cross-linguistically, i.e. language-independent constant aspect has 
to be fixed. In this volume, this is achieved in different ways. In most papers, 
semantic aspects of the considered class of MWEs are kept constant in the com- 
parison, usually together with some basic syntactic assumptions (such as looking 
at verbal MWEs). 

Bargmann & Sailer (2018 [this volume]) concentrate on one particular type 
of MWEs, the so-called non-decomposable idioms. They identify this domain by 
semantic criteria that are independent of a particular language. Subsequently, 
they look at the way in which the languages they consider differ in the syntactic 
flexibility of these MWEs. 

Fotopoulou & Giouli (2018 [this volume]) define their domain of study by se- 
mantic and syntactic criteria. They look at verbal MWEs that express emotions. 
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They use a semantic classification of emotion expressions with respect to the 
type of emotion and its intensity. On the formal side, they use a syntactic rep- 
resentation of MWEs that abstracts over some properties that are particular to 
individual languages. This allows them to identify comparable MWE classes in 
Modern Greek and French. 

Hanks et al. (2018 [this volume]) discuss a particular method to extract MWEs 
from a corpus and to classify them automatically according to their syntactic 
flexibility. They present a case study of the English word bite and its primary 
French translation mordre by looking at an identical number of hits from stan- 
dard general corpora of the two languages. They apply a Corpus Pattern Analysis 
(CPA) on this data set to identify the usage patterns of these two verbs, which in- 
clude a number of MWEs. Using statistical collocation measures on the extracted 
patterns, they manage to determine the syntactic flexibility of each of these pat- 
terns. They show that their method can be applied to different languages and 
demonstrate that the extracted patterns of English and French can be used to 
study the cross-language correspondences as regards the patterns' literal and 
idiomatic meanings. 

Osenova & Simov (2018 [this volume]) study MWEs in parallel corpora of Bul- 
garian and English. They discover that MWE translational equivalents, at least 
for the particular language pair, tend to be either MWEs themselves or just sin- 
gle words; interestingly, translating an MWE with a compositional phrase is a 
rare phenomenon in their data. In order to encode these correspondences in a 
way that can be useful to parsing, they employ catenae (O'Grady 1998), which 
are argued to offer adequate expressivity for representing the structural and the 
semantic properties of MWEs. 

Koeva et al. (2018 [this volume]) is the contribution that looks at the highest 
number of different languages. The authors compare named entities in five dif- 
ferent languages from four language groups. The category of named entities is 
defined semantically as the names given to persons, locations, or organisations. 
The authors find that depending on the kind of named entity, a number of dif- 
ferent semantic aspects may be included within a larger name - such as a title 
for a person's name, for example. They use these semantic categories to define 
language-neutral, abstract patterns. In a second step, they map these to syntactic 
patterns for individual languages and identify similarities and differences within 
the sample of languages they consider. As in the case of Fotopoulou & Giouli 
(2018 [this volume]), sticking to a clearly defined and relatively well-studied se- 
mantic domain can provide a very good basis for comparing the variation that is 
found in the morpho-syntax of the MWEs used in this domain. 
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Mititelu & Leseva (2018 [this volume]) consider a formal process, namely deriva- 
tion of MWE parts in Romanian and Bulgarian. They use the same method of data 
sampling for the two languages: they extract MWEs from general dictionaries of 
idioms and collocations. Subsequently, they extract occurrences of these MWEs 
in corpora and classify the types of derivational morphology found in their data. 
The paper establishes that the productivity of MWEs in derivation is a general 
phenomenon that should be considered more systematically than it usually is. 
The use of two languages serves primarily two purposes: first, the authors can 
make a more general point than they could have when looking at just one lan- 
guage; second, they illustrate the fruitful applicability of their method across 
languages. 

The general, cross-lingual insights made by the contributions in this volume 
comprise at least the following: 


1. For well-defined and clearly understood semantic domains, it is possible 
to create a multi-lingual MWE sample. Once this semantically classified 
sample has been established, formal properties of the MWEs within the 
samples can be explored, including syntactic structure, flexibility, or mor- 
phological aspects. In a next step, we can seek for generalisations relating 
these language-specific! formal properties to the language-neutral seman- 
tic classification, both within and across the considered languages. 


2. Ifthere are comparable resources available (corpora, MWE collections, tree- 
banks, or more advanced natural language processing tools), the methods 
of data sampling and data classification for MWEs can often be transferred 
from one language to another. This means that we will be able to use the 
same tools to study MWESs in one language and a parallel study of MWEs 
in another language. It does not mean, however, that we perform a com- 
parison of MWES in the two languages. 


2.3 Special types of MWEs 


Given the heterogeneity of MWES, it is necessary to focus on individual types of 
MWEs. Remember that we defined MWEs here as complex expressions that show 
some sort of idiosyncrasy. Consequently, MWEs differ in their basic linguistic 


"Throughout this chapter, we use language-specific or language-independent in the sense of “spe- 
cific to one language" or “independent of a particular language”, rather than in the sense of 
"specific/independent of language as such”. 
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properties, but also in the types of idiosyncrasy they display. We have already 
seen in 82.2 that the limitation to a particular type of MWE is a necessary step 
for many cross-lingual considerations. In the present subsection, we will consider 
special types of MWEs, based on their morphological or syntactic structure or 
operations rather than on their semantics. 

Focusing on special types of MWEs has been a useful method in any subdisci- 
pline of linguistics. Here is a somewhat arbitrary collection of references to illus- 
trate this point. To start with a negative example, the early Generative treatment 
of MWEs in Chomsky (1957) does not distinguish between MWEs of different de- 
gree of syntactic flexibility. This is the main reason for the validity of the critique 
of this approach brought forward in Chafe (1968). 

The importance of looking at different MWE types separately was illustrated, 
for example, in Krenn (2000) and Gibbs et al. (1989). Krenn (2000) shows that 
automatic MWE extraction from corpora may require different methods for dif- 
ferent types of MWEs. Gibbs et al. (1989) provide evidence that MWE types need 
to be carefully distinguished in psycholinguistic studies. Similarly, special MWE- 
types can be useful to address particular research questions: Hoeksema (2010), for 
example, looks at MWEs containing embedded clauses such as (1), to investigate 
how big a lexicalised linguistic unit can possibly be. Müller (1998) looks at bino- 
mials as in (2) to show that general rules of coordination in German interact with 
idiosyncratic lexical fillings in these constructions - in the present example, the 
law of growing members in co-ordination. 


(1) maken [dat X weg-kom-] ‘leave as soon as possible’ 
We moeten maken dat we weg-komen! (Dutch) 
we must make that we away-get 


“We need to leave!” 


(2) fix und fertig / "fertig und fix (German) 
fastand ready ready and fast 


'exhausted' 


In this volume, three of such special types of MWEs have been addressed in 
some of the included chapters: MWEs and morphological derivation, patterns of 
Named Entities, and Light Verb Constructions. We will briefly summarise these 
contributions. 

Mititelu & Leseva (2018 [this volume]) offer a rare contribution to the dis- 
cussion about the derivation of MWEs from MWEs. Of course, there is a lot 
of work on derivational morphology, but it does not pay extra attention to the 
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productivity of idioms. Also, there is important work advocating that morpho- 
logical derivation and MWEs should be represented with the same machinery, 
namely that of Constructions (Riehemann 2001). However, derivation phenom- 
ena have hardly been explored within the domain of MWEs, although they are 
wide-spread across languages. Below we use material from Mititelu & Leseva 
(2018 [this volume]) and add some Modern Greek and Serbian data to illustrate 
the variations of the phenomenon. In (3), the pairs of noun MWEs in three lan- 
guages, namely Bulgarian, English and Modern Greek, can be analysed as stand- 
ing in a derivation relation. In (4) and (5), an adjective MWE and a verb MWE 
can be analysed as standing in a derivation relation (Modern Greek participles 
function as adjectives). Lastly, in (6) and (7), we have adjective MWEs of the sim- 
ile type that are derivationally related with verb MWEs (headed by de-adjectival 
verbs), again of the simile type in two languages, namely Modern Greek and 
Serbian. 


(3) a. moden dizayn — moden dizayner (Bulgarian) 
b. fashion design - fashion designer (English) 
c. syieóio moóas — syieóiastis moóas (Modern Greek ) 
(4) a. svalyam zvezdi (Bulgarian) 


take.down stars 
'to promise the moon' 


b. svalyach na zvezdi 
'one who promises the moon' 


(5) a. pinao sa likos (Modern Greek) 
Lam.hungry like wolf 
“being very hungry' 
b. pinasmenos sa likos 
hungry like wolf 
“very hungry' 


(6) a. kokinos san paparuna (Modern Greek) 


red as poppy 
“red (because of blushing)’ 


b. kokinizo san paparuna 
Lbecome.redas poppy 


‘blushing a lot’ 
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(7) a. crven kao bulka (Serbian) 


red as poppy 
‘red (because of blushing)’ 


b. pocrveneo — kao bulka 
Lbecome.redas poppy 


“blushing a lot’ 


Mititelu & Leseva (2018 [this volume]) map and contrast a wide range of deriva- 
tion types in Romanian and Bulgarian and, eventually, they reveal a rather com- 
plicated and promising field of study. 

As already discussed in $2.2, Koeva et al. (2018 [this volume]) offer a strongly 
cross-linguistic account of the semantic and syntactic contexts where named en- 
tities occur. Named entities have often been treated as MWEs - naturally, only 
the named entities that are formed of more than one word are MWEs (indica- 
tively Downey et al. 2007; Vincze et al. 2011). 

Named Entity Recognition is a widely discussed research topic in computa- 
tional linguistics. In this general context, Koeva et al., building on the fact that 
named entities come in patterns in all languages, have set the ambitious goal to 
enumerate the semantic and syntactic contexts in which named entities occur in 
a set of languages, namely Bulgarian, English, French, Modern Greek, and Ser- 
bian. The authors study named entities denoting persons, locations and organisa- 
tions and show that the semantic patterns could be language independent, while 
the syntactic patterns vary to some degree according to language specificities 
such as the existence of articles and cases along with word order preferences. 

An impressive amount of literature has been dedicated to Light Verb Construc- 
tions (LVCs). Some relatively early approaches include Jespersen (1965), Gross 
(19982), Butt (1995), Mel’éuk (1998). LVCs are structures that contain a verb that 
combines with another verb or a predicative noun to yield a monoclausal struc- 
ture in which the event described is not specified by the (first) verb but by the 
other predicates. In a sense, the (first) verb is considered to have lost some of 
its semantic weight and to have turned into a "light" verb. In the example be- 
low, which has been taken from Laporte (2018 [this volume]), two translation 
equivalent expressions are given in French and English. In these examples, the 
main verb avoir/have is not used with its proper (possessive) semantics while the 
described event is specified by the noun conflit/conflict. Consequently, the verb 
avoir/ have is used as a light verb in (8). 


(8 a. Il a eu unconflit avec sa famille. (French) 
hehashada conflict with his family 
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b. He had a conflict with his family. (English) 


LVCs occur in many languages and pose interesting questions about the the- 
ory of syntax and semantics. Not surprisingly, one question is how LVCs can be 
delineated from other types of verb MWEs and from compositional structures. 
Laporte (2018 [this volume]) offers a thorough discussion of the criteria used to 
set apart LVCs from other MWEs and from compositional structures. More on 
the descriptive side, Fotopoulou & Giouli (2018 [this volume]) include LVCs in 
their contrastive study of emotive MWEs in Modern Greek and French. 

The individual types of MWEs considered in this volume constitute a repre- 
sentative subset of options. First, the studies include some frequently discussed 
structures, such as LVCs, but also structures that often remain unnoticed, such 
as derivation. Second, they include the question of what the internal structure of 
an MWE is (in its unmodified form), but also which types of operations (parts of) 
it can undergo. The paper on Named Entities by Koeva et al. (2018 [this volume]) 
clearly addresses the first type of question, whereas the discussion of derivation 
by Mititelu & Leseva (2018 [this volume]) is concerned with the second type of 
question. Related to these points is the question of whether an MWE instanti- 
ates a general pattern of the language, such as an "ordinary" verb-complement 
relation, or whether we are dealing with a particular pattern that is productively, 
though exclusively, realised by MWEs, such as, maybe, some of the Named Entity 
patterns or the LVCs addressed in some of the papers. 

We are positive that the inclusion of MWEs in the linguistic discussion of par- 
ticular structures or phenomena can lead to important insights both in our un- 
derstanding of these phenomena and our understanding of MWEs. On the other 
hand, we consider it important to take a closer look at MWE-specific patterns and 
to identify in which way their properties relate to the more general phenomena 
of a language. 


3 MWESs and linguistic theory 


MWES are situated at the overlap of the lexicon and grammar. This places them 
both at the centre and at the margins of linguistic theorizing. Theoretical discus- 
sions of MWES typically take one of the following two questions as their start- 
ing point: Can the established tools of the lexicon or grammar be used to model 
MWEs? What insights can we get on the properties of words or grammatical 
processes from looking at MWEs? The first question starts from a given theory 
and applies it to MWEs, the second starts from observations on MWEs and uses 
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them to modify the theory. Some of the papers in this volume are written from a 
particular theoretical perspective, including Generative Grammar (Kuiper's con- 
tribution), Lexicon-Grammar (Laporte and Fotopoulou & Giouli), Lexical Func- 
tional Grammar (Markantonatou & Samaridi), and Head-driven Phrase Structure 
Grammar (Bargmann & Sailer). In the present section, we will give a brief sum- 
mary ofthe role MWEs have played in these theories and how the papers in this 
volume relate to this. There are, of course, important discussions on MWEs in 
many other frameworks, which we will have to leave aside here.’ 


3.1 Generative Grammar 


Generative Grammar is a cover term for a diverse family of theories going back 
to Chomsky (1957). Since we will look separately at two "spin-off" theories, Lex- 
ical Functional Grammar and Head-driven Phrase Structure Grammar, we will 
limit ourselves here to the theoretical strand that could be called Chomskyan 
Generative Grammar whose current version is referred to as Minimalism (Chom- 
sky 1995). In this tradition, the discussion of MWEs is very much focused on id- 
iomatic, verbal MWEs. Kuiper (2004) provides an overview over the main devel- 
opments in Generative Grammar and the role MWEs have played therein. Nun- 
berg et al. (1994) give a detailed and critical evaluation of the use of MWEs in 
Generative syntactic argumentation. 

From the first mentioning of MWEs in Chomsky (1965) on, the general analytic 
conception of MWEs has been that an MWE is inserted into the syntactic deriva- 
tion as a single unit, though a unit with internal structure. An analytical chal- 
lenge arises once this assumption is combined with the idea that non-canonical 
syntactic structures are derived from an underlying basic structure that is de- 
termined by argument selection, such as Deep Structure or the result of Merge. 
McCawley (1981) shows that these assumptions are incompatible with the data 
in (9): if the MWE pull strings is inserted as a unit, its parts cannot be spread 
over a relative clause and the noun it attaches to, as in (9a). If the head of the 
relative clause is generated inside the relative clause, (9a) would no longer be a 
problem, but, then, (9b) would be problematic, where the idiomatic noun strings 
is the head of a relative clause that does not contain the rest of the idiom. 


(9) a. The strings that Parky pulled to get me the job. (McCawley 1981: 135) 
b. Parky pulled the strings that got me the job. (McCawley 1981: 137) 


Only recently, the en bloc insertion approach to MWESs has been relaxed in 
some publications, such as Harley & Stone (2013) and Corver et al. (2016). Corver 


?See the relevant overview chapters in Burger et al. (2007) for some more frameworks. 
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et al. integrate the distinction between decomposable and non-decomposable 
MWEs from Nunberg et al. (1994) into a Minimalist approach and assume dis- 
tinct structural constraints for the two types of MWEs. 

In Generative Grammar, MWEs have typically been used to test structural 
hypotheses, where two aspects of MWEs have received primary attention: first, 
their restricted yet not fully blocked syntactic flexibility, and second, their inter- 
nal structure. For example, idioms provided a major piece of empirical evidence 
for the raising analysis in Government and Binding Theory (Chomsky 1986). As 
for the second point, over the years, the size of MWEs has often been taken 
as support for various syntactic notions: the perceived inexistence of MWEs in- 
cluding subjects was used as support for the existence of a VP in syntax. More 
recently, the size of MWEs has been claimed to correlate with phrases, i.e. struc- 
tural domains that are assumed to be closed for a number of syntactic processes 
(Svenonius 2005). 

In the present volume, Kuiper proposes an interesting new way of constructing 
syntactic arguments based on MWEs. Starting from the assumption that MWEs 
typically show some kind of irregularity, he formulates the following Law of 
Exception: 


Law of Exception: All formal properties of the grammar of a language are 
subject to exceptions manifested in idiosyncrasies in the lexical items of 
that language. 


This approach allows him to derive support for a principle of grammar by 
showing that there are lexical items violating it. 


3.2 Lexical Functional Grammar (LFG) 


The generative, transformation free, phrase structure grammatical formalism of 
Lexical Functional Grammar (LFG) is: 


« Unification based: information from the different components of an utter- 
ance is unified to form the overall linguistic information content; the linear 
order of the utterance components is not important. 


e Lexicalistic: linguistic operations are divided into lexical and syntactic op- 
erations. For instance, valency changing operations are understood as lexi- 
cal properties, while co-ordination is analysed as a syntactic phenomenon. 
The syntactic component of the grammar cannot affect the lexical one. 
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LFG develops different levels of analysis that stand in a mutually constraining 
relation to each other via well-defined mappings. Different formal means may 
be employed for the representations at the various levels of analysis such as the 
m-structure, where morphological information is represented, the c-structure, 
where phrasal structure information is represented, with a tree formalism, the 
f-structure, where functional relation information such as agreement, binding, 
and control are represented using attribute-value matrices (AVMs), and the s- 
structure, which is dedicated to semantic information. In particular, crucial fea- 
tures of the f-structure are the so-called Grammatical Functions (GFs) that stand 
for things like subject and object. LFG considers them as primitive notions and 
uses them to represent relations among the phrasal constituents. MWEs were 
first mentioned in the LFG literature when Bresnan (1982) used to keep tabs on 
somebody in order to construct an argument in favour of lexicalism. In this dis- 
cussion, tabs heads a meaningless NP that instantiates the object of the structure 
or the OBJ(ect) GF in LFG parlance. When the passivisation lexical rule applies, 
tabs becomes the SUBJ(ect) of the passivised form tabs were kept on somebody. 
An idiomatic verb predicate keep is defined in the lexicon (10) that requires a 
subject, an object and an indirect argument dubbed ON OBJ. Also an idiomatic 
noun tabs is defined to be “semantically empty", which, in the LFG conception 
of grammar, entails that the noun does not introduce a predicate in the represen- 
tation. Therefore, it does not have a PRED(icate) value and it only has a FORM 
value: 


(10) keep V (T TENSE) = PRESS 
(f PRED) = ‘observe<SUBJ,ON OBJ>’ 
(f ON OBJ FORM ) =, TABS 


However, the semantically empty NP tabs should be prohibited from turning 
up as the object of other predicates. Andrews (1982) proposes a syntactic solu- 
tion that draws on a reformulation of the Coherence Principle of LFG. Kaplan 
& Bresnan (1995: 67) note that in order to face the problem posed by semanti- 
cally empty NPs “a separate condition of semantic completeness could easily be 
added to our grammaticality requirements, but such a restriction would be im- 
posed independently by a semantic translation procedure. A separate syntactic 
stipulation is therefore unnecessary.” Furthermore, Partee (2004: 158) points out 
that semantically empty NPs would be a problem for a Montague-like composi- 
tional approach to Semantics where NPs are assumed to contribute a predicate 
but that a possible solution would require a non-dispensable semantic transla- 
tion level. This discussion highlights important aspects of the LFG approach to 
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MWES. First, according to Bresnan (1982), MWEs that contain an idiomatic V NP 
component and passivise, can project an f-structure that contains an OBJ(ect) 
GF. Therefore, it is assumed that the syntax of MWES is exactly like the syntax 
of compositional language in this respect, even when the fixed parts of an MWE 
are considered. The semantic component of the theory is expected to play an im- 
portant role. Actually, state-of-the-art LFG has put emphasis on semantics and 
has offered interesting analyses of "idiomatic" constructions, such as the way- 
construction (11) (Asudeh et al. 2008: 30). 


(11) Sarah elbowed her way quickly through the crowd. 


It seems that the development of a semantic component questions the tradi- 
tional conception of the syntactic component of the theory, for instance the so- 
called semantic forms (Lowe 2015). Semantic forms have been crucial for defin- 
ing the coherence and completeness axioms of LFG that form a major part of 
the mechanism by which the grammar checks the grammaticality of strings. In a 
similar vein, several of the defining properties of Grammatical Functions reflect 
a set of behaviours that were considered syntactic, but now a lot of this burden 
may move to semantics, for instance, the conditions on the replacement of an NP 
by a clitic may be semantic in nature to some considerable degree (Arnold 2015). 
So far, passivisation has not received a semantic analysis in LFG and remains, 
so to speak, the identifier of “syntactic OBJecthoodness”, therefore a discussion 
about the ability of MWEs to passivise could still be argued to be a syntactic 
discussion. 

An applied approach to MWEs was offered by Attia (2006) embedded in an 
implemented LFG grammar of Arabic with a wide coverage. Attia has argued 
that MWEs should be disambiguated in a preprocessing step, i.e. before parsing. 
In his system, fixed and semi-fixed MWEs are processed by the morphological 
component that uses regular grammars (as opposed to the syntactic component 
that uses context-free grammars). 

Such approaches open up theoretical issues, such as which is the part of speech 
that should be assigned to the fixed parts that are treated as words and, given the 
problems stemming from passivisation, how the LFG syntactic theory is affected 
by these novel words and their syntactic reflexes. These issues are a potential chal- 
lenge for the generally accepted view that MWEs and compositional structures 
use exactly the same syntax. Markantonatou & Samaridi (2018 [this volume ]) dis- 
cuss exactly this question in the framework of LFG drawing on Modern Greek 
verb MWEs. 
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3.3 Head-driven Phrase Structure Grammar (HPSG) 


Head-driven Phrase Structure Grammar (HPSO) has its origin in phrase structure 
grammar frameworks such as Generalized Phrase Structure Grammar (Gazdar et 
al. 1985), but has received a fundamentally different formal basis as a constraint- 
based feature structure grammar (Pollard 1999; Richter 2004). HPSG encodes 
all levels of linguistic analysis within one representation, a well-articulated no- 
tion of a Saussurian linguistic sign. The lexicon is of central importance for the 
theory, as all idiosyncratic information projects from the lexicon, and valence- 
alternation processes are expressed as lexical rules. The role of syntax is largely 
restricted to allowing lexical elements to combine in order to satisfy their va- 
lence requirements and, at the same time, to build up the phonological and the 
semantic representations of a sentence. All grammar rules are strictly local, i.e., 
referring only to a mother node and its immediate daughters. Since Sag (1997), 
a proliferation of grammar rules can be observed, which has been an attempt to 
connect HPSG and Construction Grammar more closely and, ultimately, led to 
the development of Sign-Based Construction Grammar (SBCG, Sag 2012). 

There is no treatment of idioms or MWEs in Pollard & Sag (1987) or Pollard & 
Sag (1994), but at least since Krenn & Erbach (1994), there have been approaches 
to encode MWEs in HPSG. A basic obstacle to this task comes with the formali- 
sation of HPSG: every node in a syntactic tree must be licensed by the grammar. 
This blocks every attempt to integrate idiosyncratic phrasal expressions as units. 
For this reason, HPSG researchers tend to promote lexical analyses of MWEs. 
This has been done in Krenn & Erbach (1994), who use the highly expressive 
selection mechanism of HPSG to account for the co-occurrence of idiom parts. 
The sign-based character of HPSG allows selection not only for syntactic cate- 
gory and semantic type, but for fine-grained syntactic and semantic properties 
as well, including the selection of a single lexeme. The discussion on MWEs has 
motivated a number of innovations in the theory, such as the use of underspec- 
ification in semantics, the introduction of lexeme identifiers in syntax, and the 
accessibility of specifiers of phrases from a higher node. 

Many HPSG publications on MWEs have been written in the context of ma- 
chine translation projects, which includes Krenn & Erbach (1994), Copestake et 
al. (1995) and Sag et al. (2002). An important drawback of the HPSG research 
on idioms is that it is almost exclusively restricted to the discussion of English 
and German examples, though there are recent approaches to Hebrew (Herzig 
Sheinfux et al. 2015) and Japanese (Haugereid & Bond 2011). 

Recent approaches, such as Kay & Sag (2014) and Bargmann & Sailer (2018 
[this volume]), propose such a lexical analysis for all idioms that have a regu- 
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lar syntactic shape. This generates a number of research questions: (i) Can the 
idiomatic reading be derived using the regular mechanism of semantic combina- 
torics? (ii) Can the attested differences in syntactic flexibility between idioms be 
captured? (iii) Can the co-occurrence of idiom parts be guaranteed and an idiom- 
external use of idiom components be blocked? In addition, syntactically irregular 
expressions still need to be captured by idiosyncratic grammar rules, and it is far 
from sure that the required rules satisfy HPSG's locality restriction that idiosyn- 
crasy can only occur in local mother-daughter relations (see Sailer 2012). Given 
the constraint-based local nature of HPSG, the answer to (i) can only be posi- 
tive However, mechanisms of semantic combinatorics have been proposed that 
are not compatible with standard, Montagovian, assumptions of compositional- 
ity, including underspecification and redundant semantic marking. Answers to 
question (ii) typically attempt to show that what seem to be idiosyncratic restric- 
tions on the syntactic flexibility of MWEs follow from the general properties of 
the considered syntactic processes and the lexical properties of the stipulated 
idiomatic words. Kay & Sag (2014) and Bargmann & Sailer (2018 [this volume]) 
illustrate this strategy, which can be seen as a variant of the above-mentioned hy- 
pothesis in Nunberg et al. (1994) that the decomposability of an MWE is directly 
connected to its syntactic flexibility. As for question (iii), the selection mecha- 
nism is still the most popular means of ensuring the co-occurrence of idiom parts, 
while more flexible collocation mechanisms have been proposed as well (Sailer 
2004; Soehn 2009). The final question of the analysis of syntactically irregular 
expressions and, in particular, the possible depth of syntactic idiosyncrasy has 
not been addressed systematically. Richter & Sailer (2014) and Kay & Sag (2014) 
look at MWEs with embedded clauses (such as know on which side one's bread is 
buttered), but all expressions they consider are syntactically regular. 


3.4 Lexicon Grammar (LG) 


A lot of pioneering and on-going work on MWEs has the Lexicon-Grammar (LG) 
framework as a reference point. LG is not a generative grammatical framework; 
rather, it strongly advocates a classification-based approach. LG relies on the 
classification of a large number of linguistic structures using as a linguistic unit 
not the word but the simple sentence that consists of a verb, its subject and two 
objects at maximum (Gross 1982a). Various structures are identified and used as 
classificatory properties for verbs, for instance the simple transitive active voice 
phrase (NP V NP for English) and the simple passive phrase (NP be Ved by NP for 
English) are listed as independent properties (and not as an ordered pair defin- 
ing a transformation) that verb predicates such as write and die may or may not 
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have, depending on whether the corresponding structures are attested. Matri- 
ces are developed for each class of the verbs that demonstrate similar behaviour 
with respect to these properties. The columns in the matrices are named for the 
properties and the rows for the various verb predicates; the symbols ‘+’/‘—’ are 
assigned to the cells depending on whether the predicate is found in the respec- 
tive structure or not. 

Classifications rely on empirically attested phenomena on the morphological 
and the syntactic level. Still, meaning seems to retain an important role in the 
definition of verb classes. For instance Gross (1975: 401-402) explains that the 
verb dire appears in structures that are not available to other verbs of "saying" 
and then notes: 


One might describe these restrictions by means of a standard transforma- 
tional solution: the syntactic properties that have been observed for the 
verbs of /saying/ would only be attributed to the verb dire. All other verbs 
of /saying/, namely all verbs that indicate an emission of sound or of light, 
would be considered as intransitive verbs. (Gross 1975: 402) 


In other words, because certain syntactic properties are observed with only 
one verb, namely with the representative verb dire, but not with the other verbs of 
the same class, a transformation is assumed that relates the syntactic properties 
of the representative verb dire with the syntactic properties of other members of 
the class; therefore, the verbs of "saying' do not share exactly the same syntactic 
properties, and they belong to the same class because they share the "emission" 
semantics and some syntactic properties. Things being so, the class is defined not 
by the morphosyntactic properties of its members but by their meaning. 

Gross (1977) explains that LG assigns extreme importance to taxonomies be- 
cause they pertain to the scientific nature of the linguistic quest. Taxonomies 
are a standard practice in biology whereby the use of the best representative 
of a species in experiments guarantees reproducibility of results. In the case of 
linguistics, the acceptability tests of linguistic structures by native speakers are 
considered experiments. Gross (1978) discusses the drawbacks of classification 
practices, namely that they result in disjoint classes of classified objects while 
in linguistic reality few clear-cut separating lines are observed. Still, he argues 
that it is worth paying the value of (probably vast) fragmentation into (not nec- 
essarily homogeneous) classes because this is the only known way of obtaining 
an organisation of linguistic data that guarantees reproducibility of linguistic 
experiments. 
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Early on, LG applied taxonomies on MWEs; LG prefers the term fixed expres- 
sions for MWEs (Gross 1982b; 1998a,b). The continuum from fully compositio- 
nal structures to fully fixed expressions is recognised. The criteria developed 
set fixed expressions apart from terminology and professional or other sublan- 
guages, from frequently used compositional structures and from "support con- 
structions” (in $2.3 we encountered these constructions under the name “LVCs”). 
In this volume, the paper on Modern Greek and French emotive MWEs by Fo- 
topoulou & Giouli studies a set of structures, including MWEs and support verb 
constructions, that denote emotions; these structures illustrate the aforemen- 
tioned continuum between compositional and fixed language and an interesting 
cross-lingual result is obtained, namely that the degree of fixedness is related to 
the intensity of the emotion denoted. 

MWE studies owe a lot to work conducted within LG. Laporte (2018 [this vol- 
ume]) summarizes some of the work done on MWEs within LG, elaborates on 
its merits and compares the strongly data-based method of LG with the more 
hypothesis-driven approach of Generative Grammar. 


4 What do we find important from here on? 


All frameworks that are represented in this volume take a competence-oriented 
approach to MWEs, i.e., they attempt to model the possibilities rather than the 
usage of MWEs. However, with MWES in particular, it is rather difficult to draw 
the line between what is a grammatically acceptable variation of an MWE and 
what is a variation that is licensed by some special rhetoric effect such as word 
play or what Egan (2008) calls "extended" uses of MWES. Related to this, the rich 
literature on the discourse-constitutive effect of MWEs remains largely unex- 
plored in its insights for the formal study of MWEs. This has direct repercussions 
on the formal modelling. First, most competence-oriented researchers agree that 
playful use of MWES falls outside their empirical domain. They do not, however, 
necessarily agree on whether a particular lexical or structural variation of an 
MWE is an extended use or not. This has an influence on what set of data they 
aim to explain. While this is a general problem of competence-based approaches, 
it is particularly prominent in the study of MWEs. In the present volume, this 
contrast can be seen most clearly in the differences between the contributions 
by Laporte and Bargmann & Sailer: Laporte's data are based on simple sentence 
frames without context, whereas Bargmann & Sailer consider all MWE variations 
that they find in attested examples, taking into account their linguistic context 
though not the question of whether such examples would be considered rather 
unnatural by native speakers, even in the given context. 
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A second aspect that is usually left aside in the included frameworks is the 
inherent ambiguity of many MWEs. Given that idiomaticity, i.e., the presence of 
"compositional" or "literal" meaning next to an idiomatic meaning, is one of the 
three defining prototypical properties of MWEs, this is clearly a question that 
would deserve attention. Particularly intriguing are cases in which the idiomatic 
and the literal reading seem to be simultaneously present, as in (12), taken from 
Ernst (1981). 


(12) He bit his thirst-swollen tongue. 
Reading: “He bit his tongue & his tongue was thirst-swollen: 


Recent attempts to combine compositional and distributional semantics such 
as Gehrke & McNally (2016) can be considered a step towards a modelling of such 
co-existences of a literal and an idiomatic meaning. 

While there are the above-mentioned similarities between the frameworks rep- 
resented in this volume, there are also considerable differences. One major differ- 
ence is the relative importance that they attribute to theoretical concepts and to 
data. Research in Generative Grammar is typically hypothesis driven. This has 
led to many hypotheses about MWEs. While most of them have been proven 
wrong by now, they were still useful in putting the focus of phraseological re- 
search on a particular aspect and led to an increase of knowledge in this domain. 
Lexicon Grammar, on the other hand, is rather data driven and has a relatively 
long tradition of systematic compilation and classification of data. While this led 
to the creation of rich resources on MWEs, it is less obvious which implications 
generalisations over the collected data should have for the theory. We have also 
seen how lexicalist theories such as LFG and HPSG attempt to develop tools to 
account for the more phrasal phenomena that we find in MWEs. The research 
questions in such frameworks are typically very specific and partly data-driven, 
partly hypothesis-driven. 

The variety of analytic and methodological alternatives used in the theoretical 
descriptions of MWEs over the years is impressive and shows that this empirical 
domain has a lot to offer for theoretical linguistic research. We would be excited 
if the present volume stimulated more interaction and mutual reception across 
framework boundaries. There are still many types of MWEs that have not been 
described formally or for which no data have yet been collected systematically. 
Such studies can potentially corroborate or refute essential properties of a frame- 
work or at least motivate a small change in perspective. 
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Chapter 1 


The syntactic flexibility of semantically 
non-decomposable idioms 


Sascha Bargmann 
Goethe University Frankfurt/Main, Germany 


Manfred Sailer 
Goethe University Frankfurt/Main, Germany 


Nunberg et al. (1994) caused a shift in perspective from a monolithic view of all 
idioms towards a word-level approach for semantically decomposable idioms. We 
take that idea one step further and argue that a semantically non-decomposable 
idiom of syntactically regular shape can also be analyzed in terms of individual 
word-level lexical entries. We suggest that these entries combine according to the 
standard rules of syntax and that the restrictions on the syntactic flexibility of a 
semantically non-decomposable idiom follow exclusively from the interaction of 
the special semantics of these entries with the semantic and pragmatic constraints 
of the relevant syntactic constructions in a particular language. In our analysis, the 
words constituting a non-decomposable idiom make partially identical semantic 
contributions. We formulate our analysis in Lexical Resource Semantics (Richter 
& Sailer 2004). 


1 Introduction 


In this paper, we make a theoretical point for loosening the close ties that Nun- 
berg et al. (1994) claim exist between the semantic decomposability and the syn- 
tactic structure of idioms. We argue for a more uniform syntactic treatment of 
idioms within and across languages, saying that semantically non-decomposable 
idioms (henceforth abbreviated as SNDIs) like kick the bucket can and should be 
analyzed as consisting of individual word-level lexical entries that combine ac- 
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cording to the standard rules of syntax and contribute a piece of the meaning of 
the idiom. 

We mainly base our case on the contrast between English and German when it 
comes to verb placement, constituent fronting, and passivization ($2 and 83). Our 
findings suggest that the differences in the syntactic flexibility of idioms might 
be due to differences among the semantic and pragmatic constraints that hold 
for the involved syntactic constructions in a particular language, rather than to 
differences in the syntactic encoding of the idioms themselves. 

The central aspect of our analysis ($4) is that SNDIs are syntactically analyzed 
as combinations of individual words, and that these words can make identical 
semantic contributions to the overall meaning of the idiom. We formulate our 
analysis in Lexical Resource Semantics (Richter & Sailer 2004). 

Before we conclude the paper ($6), we give a short outlook on the behavior 
of SNDIs in Estonian and French (85), which provides further evidence for our 
argument. 


2 Some data and a former approach 


In this section, we will describe the behavior and architecture of SNDIs as per- 
ceived by Nunberg et al. (1994). We will look at their analysis of English data and 
challenging data from (mostly) German. 


2.1 English SNDIs in Nunberg, Sag & Wasow (1994) 


Nunberg et al. (1994), henceforth NSW, divide English idioms into two categories: 
Idiomatically Combining Expressions (ICEs) and Idiomatic Phrases (IPs). 

ICEs, exemplified here by pull strings, consist of individual word-level lexical 
entries (pull and strings), each of which contributes a piece of the meaning of the 
idiom as a whole (pull ~ ‘use’ and strings ~ 'connections). 

IPs, exemplified here by kick the bucket, are syntactically and semantically 
monolithic, i.e. the phrase as a whole is stored in the lexicon and coupled with 
the overall idiomatic meaning (kick the bucket = ‘die’). In other words: NSW do 
not assume the meaning of an IP to be distributed over individual parts, as there 
are none in their opinion, not even in those cases where a division into syntactic 
constituents seems highly plausible because the idiom appears to have a regular 
syntactic structure (as is the case with kick the bucket). 

NSW base this bifold classification on the empirical observation that many 
English idioms (those that they then categorize as ICEs) are syntactically flexible 
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to a certain degree, whereas some others (those that they then categorize as IPs) 
seem to be syntactically frozen. None of the sentences in (1) can normally be 
understood in the idiomatic sense. 


(1) a. * Alex kicked the cruel bucket. (additional adjective) 
b. * Alex kicked a bucket. (determiner variation) 
c. * The bucket (that) Alex kicked was cruel. (restrictive relative clause) 
d. * The bucket was kicked. (passive) 
e. * The bucket, Alex kicked. (NP-fronting) 
f. * It was the bucket that Alex kicked. (it-cleft) 
g. * What bucket did Alex kick? (wh-interrogative) 


According to NSW, it is the syntactic monolithicity of IPs that explains their 
non-compatibility with the syntactic constructions in (1). All the parts of an IP 
must be given in the exact same linear sequence provided by its phrasal lexical 
entry. Any disruption of that sequence results in ungrammaticality. 

This syntactic monolithicity of IPs, they say, stems from their meaning not 
being distributed over individual parts. ICEs like pull strings, on the other hand, 
allow for variations that affect the meaning of their individual components. For 
example, the meaning of the complement-NP's head noun can be restrictively 
modified or quantified over. IPs, in contrast, do not allow for any of these seman- 
tic operations, which is the reason for the ungrammaticality of (1a)- (1c). 

All things considered, NSW observe a strong correlation between the semantic 
non-decomposability and the syntactic fixedness of IPs, which induces them to 
conclude that there exists a conditional dependency between the two. If an idiom 
is semantically non-decomposable, so they argue, it is syntactically fixed and 
hence to be analyzed in terms of a phrasal lexical entry, i.e. a monolithic syntactic 


block. 


2.2 Challenging data for Nunberg, Sag & Wasow (1994) 


NSW discuss the observations made for German in earlier versions of Schenk 
(1995) and Webelhuth & Ackerman (1999) that SNDIs like den Löffel abgeben “die” 
(lit.: ‘pass on the spoon’) or ins Gras beißen “die” (lit.: “bite in the grass”) can un- 
dergo syntactic processes. These include the dislocation of the finite verb to the 
second position (V2), see (2), and the dislocation of idiom chunks to the initial 
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position (the Vorfeld), see (3a). The example in (3a) is taken from Trotzke & Zwart 
(2014: 138), example (3b) is a corpus example.! 


(2 Donn gab Alex den Löffel ab. 
then passed Alex the spoon on 


“Then Alex died? 


(3) a. DenLöffel hat er ab-gegeben. 
the spoon has he on-passed 


“He died’ 


b. Den Lóffel habe er noch nicht ab-geben wollen, ... 
the spoon has he still not on-pass want 


‘He didn't want to die yet, ...? 


NSW briefly explore a purely linearization-based/phonological explanation of 
data like those in (2). However, SNDIs also allow for passivization, see (4), a 
syntactic operation that cannot be analyzed as a simple word-order alternation, 
as it involves adding, inflecting, and often also deleting material. 


(4) Hier wurde der Löffel ab-gegeben. 
here was the spoon on-passed 


“Someone died here. 


These data suggest that an IP-like analysis is less attractive for German than 
for English, as there seem to be no syntactic restrictions in German that correlate 
with semantic non-decomposability.* 

It is worth noting that English SNDIs are not necessarily fully fixed either. We 
will list three commonly mentioned types of data that support this (see, for exam- 
ple, Baldwin & Kim 2010) and add a fourth one. First, many English SNDIs have 
the same syntactic structure as any regular English V-NP combination, which 
sets SNDIs apart from syntactically irregular expressions like kingdom come ‘par- 
adise'. Second, English SNDIs show full morphological flexibility on their verbal 
heads, see (5). 


We will not provide a full morphological glossing for German, but only indicate the parts that 
are relevant for the discussion at hand. 

?[DS corpora: N92/JAN.03243 Salzburger Nachrichten, 28.01.1992 

?Soehn (2006) pursues an IP-analysis of German SNDIs. He accounts for the data in (2) and (4) 
by his formulation of quite abstract phrasal lexical entries that leave many syntactic relations 
underspecified. A disadvantage of this account is that the lexical representation of SNDIs dif- 
fers dramatically from language to language, even for syntactically very similar idioms, such 
as those consisting of a verb and a direct object. Müller (2013b: 923) argues that an analysis 
that reflects cross-linguistic parallelism is generally to be preferred over one that does not. 
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(5) a. Alex kicks/kicked the bucket. 


b. Kim's kicking the bucket caused great concern. 
Third, SNDIs allow for certain modifiers within the complement-NP, see (6).* 
(6) Alex kicked the political/proverbial/goddamn/golden bucket. 
Fourth, we even find passive examples of kick the bucket, see (7). 


(7) When you are dead, you don't have to worry about death anymore. ... Ihe 
bucket will be kicked? 


We will turn to such examples in $3.2. For the moment, it suffices to show 
that the postulated causal relation between semantic non-decomposability and 
syntactic fixedness loses much of its appeal in the light of these data. 

We conclude that semantic non-decomposability and syntactic fixedness are 
not necessarily mutually dependent, i.e. an SNDI can show syntactic flexibility. 
This is rather obvious in German, but there are also some indications for English. 


3 Construction-specific restrictions 


In this section, we will look at German and English and point out the differences 
between these two closely-related languages when it comes to verb placement, 
constituent fronting, and the passive voice. 


3.1 German 

We will now go through the three mentioned syntactic processes in German and 
show that they impose no (or rather weak) semantic or pragmatic restrictions. 
3.11 V2-Movement 


In German, the position of the finite verb determines the clause type. In declara- 
tive main clauses, for example, the finite verb occurs in second position (V2), see 
(8a). In subordinate clauses, it typically occurs in final position (V-final), see (8b). 


*Semantically, however, none of these modifiers seems to apply to the meaning of idiomatic 
bucket. For suggestions on how these additional adjectives should be interpreted, see Ernst 
(1981) and Potts (2005), among others. 

> The Single Man by John Paschal & Mark Louis. 2000. Lincoln, NE: iUniverse. Page 195. 
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(8 a. Alex hat gestern einen Freund mit-gebracht. 
Alex has yesterday a friend along-brought 


"Alex brought along a friend yesterday: 


b. dass Alex gestern einen Freund mit-gebracht hat 
that Alex yesterday a friend along-brought has 


“that Alex brought along a friend yesterday’ 


V-final is taken to be the basic position. V2 is taken to be derived. The dis- 
location of the finite verb from V-final to V2 is commonly referred to as V2- 
movement. There are only very few restrictions as to what verbs may occur in 
V2. All of these restrictions are either morphological or syntactic, never seman- 
tic or pragmatic (Schenk 1995: 262-263). As already mentioned, the fronted verb 
must be finite, compare (8a) above with (9). 


(9) * Alex mit-gebracht gestern einen Freund hat. 
Alex along-brought yesterday a friend has 


If the fronted verb is a particle verb, the particle cannot be fronted together 
with the verb, see (10a) and (10b). 


% As pointed out to us by a reviewer, Haider (1997: 24) presents the example in (i.a) and suggests 
that some operators require the verb to be in final position to be in their semantic scope. This 
could be interpreted as a scopal effect of V2-movement, but Meinunger (2001) shows convinc- 
ingly that the data should be analyzed as a syntactic ban on stranding these operators rather 
than as a semantic effect of V2-movement. 


(i) a. Der Wert hat sich weit mehr als bloß verdreifacht. 
the value has itself far more than merely tripled 
"Ihe value has far more than merely tripled' 
b. * Der Wert verdreifachte sich weit mehr als bloß. 


"We are grateful to a reviewer for bringing up data in which a particle immediately precedes 
a fronted finite verb, see the example in (i) taken from Müller (2005: 14), and, therefore, could 
be mistaken as counterexamples to the generalization stated above. As Müller (2005) shows, 
however, these data are best analyzed with the particle inside the Vorfeld and, therefore, are 
compatible with the generalization. 


(i)... gut klar komm ich nicht. 
good clear come I not 


“... I am not coping well’ 
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(10 a. Alex bringt morgen einen Freund mit. 
Alex brings tomorrow a friend along 


‘Alex will bring along a friend tomorrow? 


b. * Alex mit-bringt morgen einen Freund. 
Alex along-brings tomorrow a friend 


3.1.2 Vorfeld placement 


In a number of German clause types, including declarative main clauses, the 
fronted verb is preceded by a constituent. This constituent appears in the so- 
called Vorfeld “prefield”. Frey (2006) argues that there are three ways that a con- 
stituent can end up in the Vorfeld. 


1. Formal movement: The Vorfeld-constituent has the same intonational and 
pragmatic properties that it would have at the beginning of a V-final clause. 
This covers pragmatically unmarked subjects, including expletives as in 
(11a) and (11b), as well as aboutness topics. Formal movement is clause- 


bounded. 


2. Base generation: This option is available for a small number of adverbials 
only. The Vorfeld-es in (11c) probably falls into this class. 


3. A-movement: The Vorfeld-constituent is moved from one ofa variety of po- 
sitions. This movement is potentially unbounded. The moved constituent 
is stressed and receives a contrastive interpretation. 


The Vorfeld-constituent can be of any syntactic category and grammatical func- 
tion. Examples (11a) and (11b) illustrate that it can also be an expletive, i.e. it need 
not make an independent semantic contribution. Even the Vorfeld-es, an exple- 
tive that is not even a dependent of the clause, is allowed, see (11c) from Müller 
(2013a: 174). 


(11) a. Es hat geregnet. 
it has rained 


“It rained? 


b. Es scheint, dass Alex schläft. 
it seems that Alex sleeps 
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c. i Eskamen drei Männer herein. 
it came threemen in 


“Three men came in. 


ii. dass (*es) drei Männer herein-kamen 
that (it) three men in-came 


Fanselow (2004) argues that German allows for what he calls pars-pro-toto 
movement, where only part of a contrastively interpreted constituent is moved 
into the Vorfeld. He provides the example in (12) (Fanselow 2004: 12) and argues 
that the question can equally well be answered by (12a) or (12b). In either case, 
the focus is on both the dative object and the verb, even though in (12a) it is only 
the dative object that occurs in the Vorfeld. 


(12) Was ist mit dem Buch passiert? “What happened to the book?’ 


a. Meiner FREUNDIN hab ich 's geschenkt. 
my.DAT girlfriend havel it given 


'I gave it to my girlfriend as a present? 
b. [Meiner Freundin geschenkt] hab ich's. 


3.1.3 Passive 


Just like V2-movement and Vorfeld-placement, passivization has no effect on the 
truth conditions of a sentence. In contrast to the previous two, however, the pas- 
sive does not mark the clause type. In German, just as in English, verbs that 
take an accusative complement usually passivize. The complement becomes the 
subject, and the subject becomes an optional oblique complement, see (13). In 
contrast to English, however, German also allows for the passivization of intran- 
sitive verbs, see (14a), and of verbs that take non-accusative complements, see 
(14b). All of these examples are taken from Müller (2013a: 287-288). 


(13) Karl öffnet das Fenster. —> Das Fenster wird (von Karl) geöffnet. 
Karl opens the window the window is (by Karl) opened 
“Karl is opening the window? “The window is being opened (by Karl)? 


(14) a. Hier wird getanzt. 
here is danced 


‘People are dancing here’ 
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b. Dem Mann wird geholfen. 
De par man is helped 


"Ihe man is being helped’ 


In German, passivization is only possible for verbs that have a referential sub- 
ject. Consequently, verbs with an expletive subject, see (15) from Müller (2013a: 
293), or no subject at all, see (16) from Müller (2013a: 295), do not passivize. 


(15) * Heute wurde geregnet. 
today was  rained 


(169 a. Dem Student graut vor der Prüfung. 
the.DAT student is.terrified of the.DAT exam 


"Ihe student is terrified by the exam’ 


b. *Dem Student wird (vom Professor) vor der Prüfung gegraut. 
the.DAT student is (by.the professor) of the.DAT exam terrified 


Müller (2013a: 289) provides the example in (17) to show that unaccusative 
verbs usually do not passivize.? 


(17) Der Zug kam an. — * Hier wurde angekommen. 
the train came on here was arrived 


“The train arrived: 


Overall, we follow Müller (2013a) and describe the German passive as demo- 
tion of a referential subject. 


*In those cases where unaccusative verbs do passivize, a special pragmatic effect is achieved. 
Müller (2013a: 305) illustrates this point with the example in (i), which can be used to express 
a generally valid rule. 


(i) Hier wird nicht an-gekommen, sondern nur ab-gefahren. 
hereis not on-come but only away-driven 


“One doesn't arrive here but only depart: 


This special pragmatic effect makes passivization possible in cases that otherwise seem 
completely out, such as with haben ‘have’: 


(ii) Hier wird keine Angst gehabt. 
hereis no fear had 


“Nobody is afraid here. / You'd better not be afraid!” 
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3.2 English 


We will now turn to parallel constructions in English and show that there are 
far stronger restrictions on fronted elements in English than in German. V2-like 
verb movement in English is restricted to auxiliaries. Since we do not know of 
any English SNDIs with an auxiliary, we will leave verb movement aside and 


focus on topicalization and passivization.? 


3.2.1 Topicalization 


Topicalization is illustrated in (18) from Ward & Birner (1994: 5). 


(18) GW: Have you finished the article yet? 
MR: The conclusion I still have to do. 


Ward & Birner (1994) argue that, in English, one of the requirements of topicali- 
zation is that the meaning of the fronted constituent be (linked to) discourse-old 
information. 

Contrary to German, English also lacks pars-pro-toto fronting. The English 
equivalent of (12a) is not a felicitous answer to a question like What happened to 
the book? because the fronted constituent is not linked to the previous context 
and English does not allow to interpret the fronted constituent just as a "pars" to 
a larger "toto" that would include the verb. 


(19) What happened to the book? # To my girlfriend, I gave it. 


Yet another observation is important for our purpose. Reflexive pronouns can 
only be fronted if they are used contrastively, as in (20a). The reflexive comple- 
ment of an inherently-reflexive predicate such as perjure cannot be used to mark 
a contrast. Consequently, it cannot be fronted, see (20b). 


(20 a. Herself Alex watched in the mirror, not Chris. 
b. * Herself Alex perjured. 


? Another potentially relevant construction is locative inversion, see (i). It involves a fronted 
non-subject and a verb that precedes the subject: 


(i) Beneath the chin lap of the helmet sprouted black whiskers. (Ward & Birner 1994: 7) 


Just as for subject-auxiliary inversion, there are very strong restrictions on the type of verb 
that may occur in this construction. In addition, there are strong discourse requirements. Again, 
we did not find an SNDI that would be a candidate for this construction, which is why we will 
not take it into consideration here. 
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We will interpret this as an indication that a topicalized constituent needs to 
make an independent contribution to the clause in which it is contained.’ 


3.2.2 Passivization 


Kuno & Takami (2004: 127) argue that subjects of English passives are topics. 
Consequently, they need to be able to refer to entities in the discourse, ideally 
to entities that are either introduced in the previous discourse or can be inferred 
from it. Ward & Birner (2004) characterize passive subjects as being relatively 
discourse-old, i.e. at least not the discourse-newest element in the clause. 

Kay & Sag (2014) provide the examples in (21) to show that expletives can occur 
as subjects of passive sentences. 


(21 a. There was believed to be another worker at the site besides the neighbors 
who witnessed the incident. 
b. It was rumored that Great Britain, in apparent violation of the terms of 
the Clayton-Bulwer treaty, had taken possession of certain islands in the 
Bay of Honduras. 


If expletives have an empty semantics, this would contradict the observations 


? A reviewer points out that fronting reflexive arguments of inherently-reflexive verbs is highly 
restricted in German as well. A bare reflexive complement of an inherently-reflexive verb can- 
not occur in the Vorfeld, see (i.a) from Müller (1999: 99-100), but if such a reflexive pronoun is 
contained in an argument-marking prepositional phrase, fronting is possible, see (i.b), which 
is parallel to an example from Müller. There is consensus, shared also by Müller (1999: 387), 
that the contrast in (i) is due to a prosodic constraint, namely that unstressable expressions 
cannot be moved to the Vorfeld. These do not only include bare inherently-reflexive pronouns 
but also accusative es ‘it’, see (i.c). 


() a. *[NP:Sich] hat Peter geschämt. 
himself has Peter be.ashamed.of 


Intended: ‘Peter was ashamed of himself. 


b. [PP:Mit sich]  schleppt der junge Mann einen Korb 
with himself drags the young man a basket 


"Ihe young man is dragging a basket ` 
https://filmchecker.wordpress.com/2013/12/13/filmreview-basket-case-1982. 
Accessed 2016-02-11. 


c. * Es haben die Kinder lesen müssen. 
itAcc have the children read must 


Intended: “The children had to read it? 
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from Kuno & Takami (2004) and Ward & Birner (2004). Kay & Sag (2014) do 
not provide any context, so we can only check on the observation from Ward & 
Birner (2004) that the subject is not the newest element in the sentence. We make 
the plausible assumption that the expletive subject is co-indexed with a post- 
verbal constituent, namely the NP another worker in (21a) and the extraposed 
that-clause in (21b). Consequently, the expletive is at best as discourse-new as 
the post-verbal constituent, which satisfies the constraint. 


4 Analysis 


We will first provide the basic idea of our analysis and then show that it allows 
us to derive the syntactic flexibility of SNDIs in a natural way. 


4.1 A redundancy-based semantic analysis 


The picture that emerged from the discussion in 82 was that the difference in the 
syntactic encoding of SNDIs and semantically decomposable idioms is question- 
able. We will propose an encoding of SNDIs in terms of individual word-level lex- 
ical entries and, based on the discussion in 83, derive the restrictions on their syn- 
tactic flexibility from the interaction of this encoding with the language-specific 
properties of the relevant syntactic constructions. This is also the position taken 
in Kay & Sag (2014), which, however, is exclusively based on English data. 

There are at least two major challenges for any analysis of idioms in terms of 
individual word-level lexical entries. First, a mechanism is needed to ensure the 
co-occurrence of the idiom's components. We will call this the collocational chal- 
lenge. Second, if the idiom's syntactic components combine according to the con- 
ventional rules of combinatorics, the idiom's semantics should equally emerge 
through the conventional mechanism of combinatorial semantics. We will call 
this the compositional challenge. 

Any approach based on the insights of NSW has presented a solution to the col- 
locational challenge. Within Head-driven Phrase Structure Grammar, for example, 
this is usually done by some sort of extended selectional mechanism (Krenn & 
Erbach 1994; Soehn & Sailer 2003; Sag 2007; Kay & Sag 2014), but more powerful 
collocational systems have also been used (Riehemann 2001; Sailer 2003; Soehn 
2006). Common to all of these approaches is a proliferation of lexical entries. 
The word kick, for example, has lexical entries for its literal and for its idiomatic 
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meanings. We will share this assumption and not elaborate on the collocational 
challenge any further - for such an elaboration, see, for example, the analysis of 
semantically decomposable idioms in Webelhuth et al. (to appear). 

What we will focus on here is the compositional challenge, which has played 
a major role in making the phrasal analysis of SNDIs so attractive. If there is no 
evidence that parts of an SNDI make an individual meaning contribution, why 
not just assign the idiom meaning to the phrase instead of its words? In light of 
the data on the syntactic flexibility of SNDIs, however, such an analysis is not 
easily tenable. 

Kay & Sag (2014) assign the entire meaning of an SNDI to its syntactic head. 
Sucha suggestion is very natural within a head-driven syntax. To the other words 
within the idiom, Kay & Sag (2014) assign an empty semantic contribution.” 
They achieve this by working within Minimal Recursion Semantics (Copestake 
et al. 1995; 2005), where semantic representations are encoded as lists of simple 
predicate-argument expressions and subordination constraints among these. An 
empty semantic contribution is simply encoded as an empty list. 

This analysis is sketched in (22). We distinguish the idiom-internal kick from 
its literal homonym by representing the former as kickjg. We proceed analo- 
gously for the other words. The semantic representation of kick;q consists of the 
predicate dies, a situation s, and the index of the subject: x. 


(22) Semantic analysis of kick the bucket à la Kay & Sag (2014) 
a. kick;a: (die;a(s, x)) 
b. theja: () 
c. bucket;a: () 


Kay & Sag (2014) derive the right semantics for the idiom and thereby solve 
the compositional challenge. They also account for the absence of an internal 
modification reading, as the noun bucket;; does not make any semantic contri- 
bution that could be modified. The semantic emptiness of bucket;; is also made 
responsible for the fact that topicalization is not possible with kick the bucket, as 
topicalization requires the topicalized constituent to be non-empty. 

In the light of the examples in (21), Kay & Sag (2014) do not impose a non- 
emptiness constraint on passive subjects. Instead, they classify the idiomatic verb 
kick; as belonging to a verb class that does not allow for passivization. 


"The earliest reference to such an approach seems to be Ruhl (1975). Unfortunately, we could 
not get a copy of this paper. NSW explicitly reject this type of approach as failing to account 
for the syntactic fixedness of SNDIs. 
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While this analysis already goes a long way in what we consider the right di- 
rection, we think that a slightly different answer to the compositional challenge 
might get us even further. Instead of empty semantic contributions for the words 
bucket;; and the;z, we assume redundant semantic contributions and make use 
of Lexical Resource Semantics (LRS, Richter & Sailer 2004). Within this frame- 
work, Richter & Sailer (2001; 2006) argue that the co-occurrence of words that 
contribute the same semantic operator (such as question or negation) is common 
in the languages of the world and, therefore, should be analyzed that way. Sailer 
(2010) extends this argument to lexical semantic contributions in his analysis 
of the English cognate object construction. The semantic contributions of signs 
used in these works are list-based, just as in Kay & Sag (2014). In contrast to Kay 
& Sag (2014), however, the different lists may contain identical elements. An- 
other difference is that the elements on the semantic contribution list need not 
be predicate-argument expressions but can be of any form. 

Our analysis of kick the bucket is sketched in (23), where we indicate the lexical 
semantic contributions of the idiom's words. 


(23) Redundancy-based semantic analysis of kick the bucket: 


(8) 


Co 


ER kick;q: (s, dea, dieia(s, a), j 
b. the;a: (s, Is(B)) 
c. bucketjq: (s, dea, dieia(s, oli 


Cp 


The verb kick;g contributes a situation s, the predicate diejg, and the formula 
that combines this predicate with its two arguments — one of them being the 
situation s. The second argument of diejq is left underspecified, as its semantics 
will come from the subject. This underspecification is indicated with a lower- 
case Greek letter, here a, which is used as a meta-variable over expressions of 
our semantic representation language. The verb also contributes an existential 
quantification over the situational variable: 4s(3). The meta-variable 3 indicates 
that the scope of the quantifier is underspecified. 


In other words, kick;4 contributes the same kinds of elements as other verbs. 
Similarly, the semantic contribution of the determiner thej; is just like that of a 
normal determiner. It contributes a variable and a quantification over this vari- 
able. The noun bucket;4, just like other common nouns, contributes a referential 
variable and a predicate. 
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While the semantic contributions of the idiomatic words in (23) are analogous 
to those of non-idiomatic words, it can be seen that the contributions of the; 
and bucket; are contained in the contribution of kick;4.? This is what we refer 
to as redundant marking. 

When words combine to form a phrase, their meaning contributions are col- 
lected, i.e. the list of semantic contributions of a phrase contains all the elements 
of its daughters’ lists. For the sentence Alex kicked;¿ the;a bucket;¿, the semantic 
contribution list will contain all the elements listed in (23) plus the contribution 
of the word Alex, which is just the constant alex. 

At the sentence level, all the elements of this list must be combined into a sin- 
gle formula. To do this, each meta-variable must be assigned an element from the 
contribution list as its value. In our case, œ would be assigned alex, which results 
in die;;(s, alex). This formula is taken as the value of the meta-variable 6. This 
leads to the intended semantic representation of the sentence: Is(die;a (s, alex)). 
The constant dea occurs only once in this logical form, even though it is con- 
tributed by two words in the sentence - kick;g and bucket;.. 

The redundancy-based analysis of kick the bucket will directly carry over to 
other SNDIs, be it in English or in other languages. In our case, the same semantic 
contributions would be assumed for the words in the German idiom den Lóffel 
abgeben “die”. 

In the next two subsections, we will look more closely at the syntactic flexi- 
bility of SNDIs. We will show that the attested behavior follows directly from the 
interaction of the proposed analysis of SNDIs and the construction-specific con- 
straints presented in $3. We will also show some advantages of the redundancy- 
based approach over the one of Kay & Sag (2014). 


4.2 Syntactic flexibility of German SNDIs 


We will go through the three phenomena of German syntax discussed in 83.1 and 
look at them in the light of SNDIs. 


4.2.1 German SNDIs and V2-movement 


The restrictions on V2-movement are syntactic in nature and do not at all depend 
on the content of the verb. We hence expect that these constraints hold for the 
verbs in SNDIs. This is borne out. With den Lóffel abgeben, for example, which 
contains a verb with the separable particle ab, a non-finite verb following the 


Technically, this effect can be achieved through selection. The selecting verb requires its com- 
plement to have the same index and to contribute the same constant: diei. 
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Vorfeld is ungrammatical, see (24b), and so is fronting the finite verb together 
with the particle, see (25b).% 


(24 a. Alex hat den Löffel ab-gegeben. 
Alex has the spoon on-passed 
b. * Alex ab-gegeben den Lóffel hat. 


(25) a. Alex gab den Löffel ab. 
Alex passed the spoon on 


"Alex died: 
b. * Alex ab-gab den Lóffel. 


4.2.2 German SNDIs and Vorfeld placement 


As we saw in 83.1.2, there are three possibilities for a constituent to be licensed 
in the Vorfeld: formal movement, base generation, and A-movement for con- 
trast. Fanselow (2004) provides examples of Vorfeld placement of constituents 
of SNDIs. One of his examples is given in (26) (from Fanselow 2004: 22), where 
the PP-constituent of the idiom am Hungertuch nagen ‘be very poor’ (lit.: ‘gnaw 
at the hunger cloth’) is fronted. The sentence has a contrastive interpretation; the 
alternatives are various degrees of poorness. 


(26) Am  Hunger-tuch müssen wir noch nicht nagen. 
on.the hunger-cloth must we yet not gnaw 


"We are not down on our uppers, yet! 


BThere are idioms where the verb must be in V2-position. Richter & Sailer (2009: 300) claim that 
the idiom in (i) has a fixed Vorfeld element followed by the finite form tritt. We think that this 
is due to the fact that this is an idiom with a “pragmatic point" (Fillmore et al. 1988) and, thus, 
a certain illocutionary force is part of the idiom, which is not compatible with a V-final clause. 


(i) a. Ich glaub, mich tritt ein Pferd! 
I believe me.Acc kicks a horse 


“Tam very surprised: / 'I can't believe this!’ 
b. + Ich glaub, dass mich ein Pferd tritt. 
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When we apply these considerations to den Lóffel abgeben, we see that in an 
active sentence, fronting the NP den Lóffel should be unproblematic under a con- 
trastive reading." 

This is shown in (27), where the alternatives are other consequences of serious 
illness. 


(27) Essind zwar viele schwer krank geworden, den Löffel hat aber noch 
it are admittedly many heavy sick become, the spoon has but still 
niemand ab-gegeben. 
nobody on-passed 


"Ihough many got seriously sick, nobody has died yet. 


These contrastive cases clearly distinguish between our analysis and that of 
Kay & Sag (2014). Since the NP den Lóffel contributes the same situational vari- 
able as the verb abgeben, it is easy to know to which larger "toto" the fronted 
"pars" belongs. In an analysis with an empty semantics of the NP, this would not 
be possible. 


4.2.3 German SNDIs and the passive 


We expect the passivizability of SNDIs to follow from the interaction between 
the above analysis and the general properties of the German passive discussed in 
in 83.1. The German passive voice demotes the subject of an active clause. In our 
analysis, a passive verb requires that there be a participant filling the thematic 
role of the active subject and that this subject have a non-redundant index." 


"For the non-contrastive case, we find clause-initial placement of the Löffel-NP in V-final 
clauses, at least in the passive. This shows that the idiom-internal NP can be fronted by formal 
movement. 


(i) Da ist nichts mehr zu machen. Nothing can be done anymore. 
a. Es sieht so aus, also ob [der Löffel jetzt endgültig ab-gegeben ist]. 
it looks so out as if the spoon now definitively on-passed is 
‘Tt looks like it is definitely over now: 


b. Der Lóffel ist jetzt endgültig ab-gegeben. 
‘It is definitely over now: 


BA bit more technically, the index of the active subject must not be identical with the index of 
the active verb or of any of the verb's arguments. This restriction does not seem to be valid for 
German only, but can be used to derive the ungrammaticality of “Alex; was shaved by himselfi. 
A reviewer pointed out that a reflexive pronoun is possible in a by-phrase in a context that 
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There are additional restrictions on verbs that cannot be passivized or only with 
the special pragmatic effect mentioned in Footnote 8. 

Dobrovol'skij (2000) argues that a VP-idiom, semantically decomposable or 
not, can never be passivized if the literal counterpart of the idiom's verb cannot 
be passivized. His example is the semantically decomposable idiom einen Korb 
bekommen ‘get the brush-off (lit.: ‘receive a basket’), which can neither be pas- 
sivized in its literal nor in its idiomatic reading. 

Idioms with an expletive subject do not passivize either. An example is Bind- 
füden regnen ‘rain heavily’ (lit.: ‘rain strings”), see (28). 


(28) * Hier werden/wird Bindfäden geregnet. 
here are/is strings rained 


This is expected under our analysis. The LRS analysis of expletives is redun- 
dancy-based. For weather verbs, Levine et al. (2014) assume that the expletive 
subject has the same index as the verb. Consequently, the sentence in (28) violates 
the constraint that the demoted subject must not have a redundant index. 

A reviewer brought the example in (29a) to our attention. Müller (2002: 131) 
points out that if (29b) is the active counterpart of (29a), one is forced to allow 
the weather-es to be the underlying subject of a passive. This might undermine 
the explanation for blocking (28). 


(29) a. Die Stühle wurden nass geregnet. 
the chairs were wet rained 


"Ihe rain caused the chairs to become wet. 


b. Es hat die Stühle nass geregnet. 
it has the chairs wet rained 


Our semantic-based constraint on passivization does not run into this problem. 
We give a very rough sketch of the logical form of (29) in (30). This formula can 
be paraphrased as in the following sentence. There are the eventualities s, s’, and 
s", such that sis a raining event, s' is a state with wet chairs, and s" is a causation 
event in which the raining s causes the wetness s’. 


(30) ds ds’ ds” (rain(s) ^ wet(s', the-chairs) ^ cause(s”, s, s’)) 


evokes alternatives to the reflexive pronoun, such as Chris was shaved by Alex and Alex was 
shaved by himself. This exception is clearly connected to a special semantics to which our non- 
redundant index requirement would need to be adapted. 
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Following the syntactic analysis in Müller (2002: 241), the resultative version 
of regnen comes about by a lexical rule that changes the verb's valence require- 
ment and adds the semantic material required for the causation/result semantics. 
When one adapts this rule to LRS, it also changes the index of the verb from the 
raining event to the causation event. Consequently, resultative regnen in (29) has 
the index s" in (30), whereas the raining - and, by redundancy, the expletive es - 
has the index s. Since the underlying active subject and the passivized verb have 
distinct indices under this analysis, the grammaticality of (29a) is predicted. Note 
that this analysis, again, is possible under a redundancy analysis of expletives but 
hard to implement if one assumes an empty semantics for expletives. 

As for verbs allowing for passivization, Dobrovol'skij (2000: 561) distinguishes 
between idioms with idiom-external accusative objects, as in (31), and those with 
idiom-internal accusatives, as in his example in (32). For the former, there is no 
idiom-specific restriction on passivization. 


(31) etwas auf Eis legen ‘put something on hold’ 
Das Projekt wurde auf Eis gelegt. 
the project was on ice put 


"Ihe project was put on hold? 


(32) jemandem den Garaus machen 'kill someone' 
.. den lästigen Hausgenossen soll nun ... der Garaus 
the.DAT annoying housemates should now the.nom Garaus 
gemacht werden ... 
made be 


“.. the annoying housemates should now be killed ..! 


Dobrovol’skij (2000) assumes that the main function ofthe German passive is 
to promote an accusative complement. This promotion has the syntactic effect 
of realizing the underlying accusative complement as a subject and the seman- 
tic/pragmatic effect of assigning its referent the status of a topic. Based on these 
assumptions, he diagnoses a syntax-semantics mismatch in sentences like (32). 
Syntactically, he says, the idiom-internal NP is promoted, but semantically it is 
the idiom-external dative NP. In a subject-demotion approach, no such mismatch 
needs to be assumed for (32). We can derive the topicality of the dative NP from 
the fact that it occurs in a topic position - here, its appearance in the Vorfeld 
through formal movement (see 83.1.2). 

Dobrovol'skij (2000) only considers passives of transitive verbs with an agen- 
tive meaning. Our approach does not have this limitation. We expect the passive 
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to be possible with idioms having a non-agentive idiomatic meaning, such as den 
Löffel abgeben, for which we can indeed find examples, see (33). 


(33) Bei den Grünen wird der politische Löffel schon vor 
at the Green.party is the political spoon already before 
Amtsabschied ab-gegeben. 
resigning on-passed 
‘In the Green Party, people die politically already before resigning from 
their office." 


In this section, we argued that the restrictions on three syntactic processes of 
German (V2-movement, fronting, and passivization) are very weak and compat- 
ible with the syntactic, semantic, and pragmatic properties of an SNDI such as 
den Löffel abgeben. We therefore expect that the idiom can occur in all of them. 


4.3 Syntactic flexibility of English SNDIs 


We saw in 83.2 that English imposes semantic constraints on frontable constitu- 
ents and on passive subjects. We will now explore the interaction of these con- 
straints with our lexical encoding of SNDIs. 

For topicalization, we saw in $3.2 that the topicalized constituent must be ex- 
plicitly linked to the previous discourse, and that it must make an independent 
semantic contribution within its clause. In LRS, such a non-redundancy require- 
ment can be expressed easily by saying that the semantic contribution of the 
topicalized constituent must not be properly included in the semantic contribu- 
tion of the rest of the clause. In our analysis, the meaning of the NP the bucket 
is fully included in the meaning of the rest of the clause. Therefore, the ban on 
topicalization follows directly. 

Matters are slightly more complicated when we look at the passive voice. The 
constraints on a passive subject have been shown to be weaker than those on a 
topicalized constituent. We saw above that a passive subject must refer to some- 
thing that has been mentioned earlier in the discourse (or that can be inferred 
from such an element). This does not exclude the possibility of the subject mak- 
ing a semantic contribution that is contained in that of the rest of the sentence - 
as we saw in the cases of expletive passive subjects in (21). 


'http://www.kontextwochenzeitung.de/politik/148/erst-schreien-wenn-etwas-geschafft-ist- 
1992.html. Accessed 2014-12-19. 
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Consequently, if the discourse conditions on passive subjects are met, even 
English SNDIs can be passivized. In (7), repeated in (34), kick the bucket is topical, 
only the tense and the result state are new. 


(34) When you are dead, you don't have to worry about death anymore. ... Ihe 
bucket will be kicked. 


The example in (34) is one out of admittedly few naturally occurring examples 
of the passive with this idiom." The following examples show passives for other 
idioms that are classified as IPs in NSW, see (35), or do not pass the tests for 
semantic decomposability, see (36). Example (36) shows particularly clearly that 
the meaning ofthe idiom have a cow is discourse-old, as it is explicitly mentioned 
in the preceding clause.'? 


(35) saw logs ‘snore’ 
I excitedly yet partially delusional turned to Alexandria to point out the sun 
as it set and all I see is eyelids and hear logs being sawed. Come on! I can't 
say too much because I wasn't far behind as I was catching flies [- sleeping] 
about a minute later.” 


(36) have a cow “get angry” 
There was really no need for the police to have a cow, but a cow was had, 
resulting in kettling, CS gas and 182 arrests.”° 


An approach that assumes an empty semantics for the idiom-internal NP the 
bucket runs into severe problems. We saw above that passivization is possible 
for SNDIs if the strong discourse requirements are met. Thus, it would be wrong 
to categorically block the passivization of kick;;. Our approach correctly pre- 
dicts the admittedly rare occurrence of passives with this idiom. Furthermore, 
an empty semantics for the bucket does not allow us to relate the NP's mean- 
ing to the preceding discourse. A redundancy-based account makes the required 
semantic information available at the clause-initial constituent. 


In a recent talk, Christiane Fellbaum presented two other naturally occurring examples of kick- 
the-bucket passives and passives of other English idioms that express the idea of “dying”. In 
as far as context is included in her examples, they also satisfy the topicality requirement. See: 
http://www.crissp.be/wp-content/uploads/2015/04/Talk7-Fellbaum.pdf. Accessed 2015-08-27. 

BNote that even though the examples in (35) and (36) may have a playful character, they do not 
blend the idiomatic and the non-idiomatic reading, as it would typically be the case in jokes 
or puns. 

Phttp://5050experience.sportsblog.com/posts/1125677/feast.html. Accessed 2015-07-24. 

**http://www.theguardian.com/commentisfree/2012/aug/01/cyclists-like-pedestrians-must-get- 
angry. Accessed 2015-08-24. 
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Let us conclude $4 with a brief summary of our analysis. We replaced NSW’s 
causal relation between the semantic decomposability and the syntactic flexibil- 
ity of idioms with an approach based on the interaction of the properties of id- 
ioms with the constraints on syntactic constructions. While, overall, our account 
is very similar to Kay & Sag (2014), an important difference is that we make use 
of redundant marking, a choice which we hope to have motivated above. 


5 Extension to other languages 


So far, we have only looked at English and German. These two closely-related 
languages already show considerable differences in their syntactic constructions, 
and these differences have far-reaching consequences for the flexibility of MWEs. 
In this section, we would like to briefly show that other languages have yet other 
constraints on similar syntactic operations and that these have a predictable ef- 
fect on the flexibility of idioms. 


5.1 Estonian 


Muischnek & Kaalep (2010) name and describe a number of problems in apply- 
ing an English-based classification of idioms to Estonian. Similar to German, Es- 
tonian allows for considerably more word-order flexibility than English. Muis- 
chnek & Kaalep (2010: 122) argue that Estonian has a passive-like construction 
whose function is to background a (usually human) subject, rather than to fore- 
ground an object. This is similar to the function of the passive in German. Con- 
sequently, passivizing intransitive verbs is possible, see (37). 


(37) Mees jookseb — Joostakse 
man run.PRESENT run.IMPERS 
“The man is running. “Somebody is running: 


In order to emphasize its subject-backgrounding function, this construction 
is called impersonal passive. In contrast to German, there is no change in the 
morphological case of the active direct object, see (38). This leads us to expect 
that the lack of object foregrounding might be even stronger in Estonian than in 


German.”! 


“The differences between German passives and Estonian impersonal passives are discussed in 
detail in Blevins (2003). 
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(38) Mees loeb raamatut. — Loetakse raamatut. 
Man read.PRESENT book.PART read.IMPERS book.PART 
‘The man is reading a book’ ‘A book is being read’; 


“Somebody is reading a book? 


Muischnek & Kaalep (2010) state that the impersonal passive can be formed 
with all idioms, including SNDIs. The only condition is that the active subject be 
human. Kadri Muischnek (personal communication) kindly provided us with the 
example in (39). 


(39) Kas massiliselt heideti hinge? 
Q massively threw.IMPERS soul.PART 


“Did they die massively??? 


5.2 French 


In French, we see yet a different pattern. Abeillé (1995) lists French idioms that do 
not permit internal modification but do permit the passive voice, such as faire un 
carton “hit the bull’ (lit.: “make a box”). These reported data suggest that French is 
more like German than like English when it comes to the passive. Lamiroy (1993) 
provides convincing arguments that this is indeed the case. Instead of promoting 
a non-subject argument, the French passive also primarily demotes a subject. 
French allows for the passivization of strictly intransitive verbs, see (40a) from 
Lamiroy (1993: 54), but not as productively as German, see (40b). 


(40) a. Il a été dormi dans mon lit. 
it.EXPLETIVE has been slept in my bed 


‘Someone had been sleeping in my bed! 


b. Ils courent. —> * Il est fréquemment couru ici. 
they run itis often run here 
“They are running: “There is often someone running here! 


We will leave the details of the passivizability of intransitive verbs in French 
aside. Gaatone (1993) gives examples of passivized French SNDIs, including the 
one in (41) (see Gaatone 1993: 47).?? 


From the etTenTen corpus: http://www.keeleveeb.ee. 

The English counterpart wear the pants syntactically behaves like kick the bucket. The corre- 
sponding German expression die Hosen an-haben (lit.: have the pants on”) cannot be passivized 
since the verb haben 'have' is unpassivizable in general. 
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(41) porter la culotte “wear the pants' 
Mmeet | M. Armand y régnent paternellement, bien que la 
Mrs and Mr Armand there rule paternally even though the 
culotte y ` soit portée par madame... 
pants thereis worn by madam 


‘Mrs and Mr Armand rule there paternally even though she is the 
dominant part 


In this section, we showed that our results of the German-English contrast 
carry over to other languages as well. Whether or not an SNDI can appear in a 
certain syntactic construction is dependent on the constraints on that construc- 
tion in the particular language. Languages may differ significantly with regard 
to these constraints. For this reason, classical tests for classifying idioms, such as 
passivizability and fronting, cannot be easily applied across languages but need 
to be re-examined in each individual case. 


6 Conclusion 


Wasow et al. (1983) and Nunberg et al. (1994) have led to a shift in perspective 
from a monolithic, fully phrasal view of all idioms to a more lexical approach for 
semantically decomposable idioms. We agree with Kay & Sag (2014) in extending 
this lexical approach to SNDIs.”4 In order to provide a solid motivation for this 
step, itis essential to look at a larger set of languages, in particular languages that 
differ in the semantic and pragmatic properties of morphosyntactically similar 
constructions. The present paper made a first step in that direction and looked at 
verb fronting, topicalization, and passivization in German and English as well as 
the impersonal passive in Estonian and the passive in French. Whereas Nunberg 
et al. (1994) are forced to analyze English and German SNDIs in considerably 
different ways, the lexical analysis presented here provides a cross-linguistically 
uniform analysis.” 

This type of analysis has consequences for the encoding of multiword expres- 
sions (MWE) in formal grammar in general. All MWEs that are of syntactically 
regular shape should receive a lexical encoding. The difference between seman- 
tically decomposable and semantically non-decomposable MWEs lies in the way 


Parallel treatments of SNDIs and semantically decomposable idioms have recently been pro- 
posed within other frameworks as well; see a short remark in Harley & Stone (2013: fn.2) 
within a Minimalist approach and Lichte & Kallmeyer (2016) for Tree Adjoining Grammar. 

We side with Müller (2013b: 923), who states: “If we can choose between several theoretical 
approaches, ...we should take the one that can capture cross-linguistic generalizations” 
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in which the semantics of the MWE is distributed over the words constituting the 
MWE. Whereas the parts ofa semantically decomposable MWE have an indepen- 
dent, i.e. non-redundant, meaning, the parts of a semantically non-decomposable 
MWE do not. Differences in the syntactic flexibility of semantically decompos- 
able and semantically non-decomposable MWEs follow exclusively from the in- 
teraction between the language-specific constraints on a syntactic operation and 
the semantics of the MWE's constituents. 
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Named entities (NEs) constitute a great challenge for computational linguistics and 
one of the major research topics during the last decade. They can be divided in cat- 
egories describing people, location, time, organization and others. In this paper 
we will restrict our discussion to proper names that belong to three main classes: 
personal, location and organization names, and that can be either single-word 
nouns or multiword expressions. First, we are going to define common (language- 
independent) semantic patterns for proper names and then we will present the cor- 
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responding syntactic patterns in English, Bulgarian, French, Greek, and Serbian. 
We will compare these patterns regarding grammatical categories of dependent 
constituents, definiteness, distribution of clitics, word order and various alterna- 
tions. Our ultimate goal is to build a universal framework for Named Entity Recog- 
nition (NER). 


1 Introduction 


Proper names are usually defined as belonging to the following main classes: per- 
sonal names, location names, and organization names, called also named entities 
(NEs). They can be single-word nouns or particular types of multiword expres- 
sion (MWE). 

The aim of this paper is to offer a common template for description and classi- 
fication of proper names in different languages. Our objectives are: i) to formu- 
late semantic patterns for personal, location and organization names that capture 
the general semantics and should be, to a great extent, language-neutral; ii) to de- 
scribe language-specific syntactic patterns corresponding to a common semantic 
pattern. The syntactic patterns provide information about the grammatical class 
of the head and constituents; dependencies among the constituents; word order 
and contiguity; cliticisation (if applicable - for possessive pronoun and interrog- 
ative clitics). 

This study is based on evidence gathered from five languages - English, Bul- 
garian, French, Greek, and Serbian - belonging to four different language groups 
(Germanic, Hellenic, Romance, and Slavic). The utility of language-neutral se- 
mantic patterns lies in the fact that they can be applied to new languages, thus 
paving the way towards more universal solutions for (rule-based) named entity 
recognition (NER). The set of language-specific syntactic patterns displays corre- 
spondences between morphological and syntactic language-specific characteris- 
tics, and they may serve as transformation rules in rule-based machine transla- 
tion, cross-lingual information extraction and summarization. 


2 Names: a general overview 


In English (Huddleston 1988: 96), two different terms are often used: proper noun 
- referring to the part-of-speech of the word and comprising only single-word 
proper names, e.g., John, London, Adidas, and proper name - referring to the func- 
tion of these words as referential elements and comprising single- and multiword 
proper names, as in John, John Smith Junior, London, the United States of America, 
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Nike, Microsoft Corporation. Following this distinction, proper names can be fur- 
ther specified as: proper nouns (Anna, Asia, Google), multiword expressions (Jean- 
Pierre Deckles, New York, the United Nations), and noun phrases (Professor Deckles, 
New York City, the United Nations Organization). Proper names - expressed either 
by proper nouns or by MWEs - show common semantic and syntactic behavior 
and we describe them in a uniform way. 

Proper names do not "[d]escribe or specify characteristics of objects" but are 
“logically connected with characteristics of the object to which they refer" (Searle 
1958: 173). For example, Saint Petersburg may refer to the second largest city in 
Russia (They convened in Saint Petersburg), a city in Florida, a city in Pennsylva- 
nia, the fictional hometown of Tom Sawyer and Huckleberry Finn, St. Petersburg, 
Missouri, but also to a college in Florida - Saint Petersburg College. A particu- 
lar common noun (i.e., city, street, president, actor) specifies the object whose 
instances may be represented by a set of proper names; and such an object is 
always presupposed for a given proper name even if the common noun is not 
mentioned explicitly in the text. Furthermore, multiword names may comprise 
a category word (square in Trafalgar Square; ocean in the Indian Ocean), and in 
these cases, the category of the particular name is always explicitly shown (Car- 
roll 1985: 144). 

The relation between a proper name and its category object is reflected in 
WordNet where the relation between a concept and its instances is defined as 
an instant hypernym (instant hyponym) relation (Rodríguez et al. 1998). The in- 
stances (proper nouns) inherit characteristics from the concepts of the hierarchy 
to which they belong. For example, the name Saint Petersburg is an instance of 
a city, and the concept (city, metropolis, urban centre} links to the more gen- 
eral concept {region}, which, in its turn, links to an even more general concept 
{location}. 

Different terms are used for common nouns that categorise the referents of 
proper names as members of different classes: descriptors, designators, category 
words (Carroll 1985), external evidence (McDonald 1996), triggers (Magnini et al. 
2002), trigger words. To avoid confusion with the theory of reference, we will 
use the term trigger. 

Triggers depend semantically on the referent of the personal name, and dif- 
ferent names select different classes of triggers. In turn, triggers determine the 
characteristics of the object to which the name refers. For example, if we know 
that the word Washington is a family name, it can select the word president or 
the word actor. Further, the word president and the word actor are similar in the 
way they designate the concept for a person, and this determines the fact that 
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both nouns can co-occur with adjectives denoting height, age, etc. The meaning 
of both words also implies that they may be specified by employing expressions 
for affiliation as complements (the President of the USA, the actor at the Muppet 
Theatre). However, not all words that are compatible with the first noun are com- 
patible with the second (stage actor vs. * stage president). Therefore, the notion of 
triggers is central for the classification of the semantic patterns of proper names 
and accordingly - for the description of the respective syntactic patterns. 


3 Grammatical features of names in Bulgarian, English, 
French, Greek and Serbian: a brief overview 


Personal names are singular and inherently definite (the same applies to loca- 
tion and organization names that cannot express definiteness). Some Bulgarian, 
English, French, Greek, and Serbian location and organization names are in sin- 
gularia tantum or pluralia tantum, or marked for definiteness: with a definite 
article (English, French, Greek), with a definite article attached to the noun trig- 
ger with no pre-nominal modifiers or to the leftmost modifier in Bulgarian, and 
only with the definite form of adjectives in Serbian. 

Bulgarian, French, Greek, and Serbian personal, location (apart from cities in 
French which usually do not express gender) and organization names are marked 
for grammatical noun gender — masculine or feminine, in contrast to the English 
ones. Location and organization names in Bulgarian, Greek and Serbian can be 
marked for neuter, as well. In Greek and Serbian, proper names have the nomina- 
tive, accusative, genitive and vocative case. In Serbian, names can also be declined 
in the dative, locative and instrumental, while in Bulgarian vocative is observed 
only with some forenames. 

Syntactically, proper nouns are heads of noun phrases but show restricted 
combinatorial properties compared to common nouns. For the five languages 
discussed in this paper, the forenames can be extended with one or more (rarely 
more than two) proper nouns: a nickname, a patronym and/or a family name. 
Agreement in gender and number is observed if they are of Slavic and Greek ori- 
gin. For feminine surnames of Slavic origin in Serbian, the agreement in gender 
is allowed but not obligatory. Bulgarian, French, Greek and Serbian adjectives 
and Bulgarian, French and Serbian possessive pronouns change to agree in gen- 
der and number with the nouns they modify. Greek and Serbian adjectives and 
possessive pronouns agree in case with the head noun. 

Compared to personal names, location names have a more diverse structure, 
while organization names show the highest complexity. Both location and orga- 
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nization names can be proper nouns or proper names, comprising proper and 
common nouns or noun phrases, which begin to function as names of geograph- 
ical locations and organizations, respectively. 


4 Names and multiword expressions 


Many names are composed of more than one word and are classified as multi- 
word names. They can comprise two or more proper nouns (Ray Jackendoff, Mer- 
ill Lynch); common and proper nouns (Bulgarian: Republika Bálgariya Republic 
Bulgaria’); adjectives and a proper or a common noun (International Monetary 
Fund, Upper Manhattan); abbreviations (Financial Advisors Ltd., John Smith Jr, 
Miami, FL); numerals or numbers (the Second Generative Grammar Conference; 
XX Generative Linguistics Conference); verbs and adverbs with names of prod- 
ucts such as books, movies, songs (Someone to Watch over Me; Killing Me Softly), 
etc. 

Anderson (2007) provides a detailed classification of proper names, a subset of 
which is relevant for our study, as follows: simple opaque names (John); simple 
names that have a resemblance to a common word (Prudence); names based on 
other names (Lincoln - for a boulevard); names overtly derived from other names 
(Slavic family names); names based on compounds, some of them containing a 
name (Queensland, Newtown); names based on longer phrases - they may include 
another name (the University of Queensland) or not (Long Island, Hen and Chicken 
Island); and names based on sentences (as with titles of movies). 

An important feature that systematically distinguishes location and organiza- 
tion names from personal names is that the TRIGGERS may be their integral part, 
constituting a MWE (Bulgarian: Cerno MORE ‘Black Sra”! English: PRESIDENT 
Roosevelt Boulevard, First Investment BANK, French: BANQUE de France ‘BANK of 
France’, Greek: Tparela mc EdAádoc BANK of Greece”, Serbian: Jadransko MORE 
"Adriatic sea”, Međunarodni sup pravde ‘International Court of Justice”). 

Carroll (1985) describes the non-classifying part of the location name as a 
name-stem (e. g., Trafalgar in Trafalgar Square) and explores rules according to 
which the name-stem can be used to stand for the whole name. Not only for some 
location names, but also for some organization names with internal triggers, the 
name-stem can replace the whole name, e.g., French: LA MAISON D’EDITION Ha- 


"The translations in the paper are closer to literal translations than to proper ones, e.g., Bulgar- 
ian: Republika Bálgariya “Republic Bulgaria’ instead of Bulgarian Republic’ or “The Republic 
of Bulgaria”, but Greek: o EAAnvac Ilpw8uroupyög ‘the Greek PRIME MINISTER’ (instead of 
“premier”, for example). 
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chette ‘the Hachette PUBLISHING COMPANY or just Hachette, Greek: y OAunmxr) 
‘the Olympic’ for Olympic Aırways. 


Further, alocation name may feature a personal name specified by a PERSONAL 


TRIGGER (SAN Jorge River) that cannot be omitted without loss ofthe name func- 


tion; similarly, an organization name may feature a personal name specified by 
a PERSONAL TRIGGER (SAN Jose State University) or a location name specified by 
a LOCATION TRIGGER (Los Angeles City College). 


For the purposes of our study, we differentiate the names on the basis of their 


structure: i) whether the name is aMWE or a noun; ii) whether the multiword 


name obligatorily incorporates a trigger (an internal trigger); and iii) whether 
the name (either single or a MWE) is optionally specified by a trigger (an exter- 
nal trigger). The external triggers may be explicit or implicit, depending on the 
context (the City of New York, New York City vs. New York): 
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Single-word personal name (Arthur). 
Multiword personal name (Arthur Conan Doyle). 


Multiword personal name, which incorporates an internal personal trigger. 
When people are famous, combinations with triggers such as holy, aristo- 
cratic and religious titles can be widely used and are stable (Pore John Paul 
II). 


Single-word personal name; it is specified by an external personal trigger 
(UNCLE John). Kinship terms are usually combined with a single-word per- 
sonal name. 


Multiword personal name; it is specified by an external personal trigger 
(PROFESSOR Steven Pinker). 


Single-word location name (it can coincide or not with a personal name) 
(Danube, Washington). 


Multiword location name (it may - partially - coincide or not with a per- 
sonal name) (Little Rock, SAN Antonio). 


Multiword location name, comprising an internal location trigger (Rocky 
MOUNTAINS). No additional location trigger of the same type can be added; 
being part of the name, the trigger cannot be omitted either. A multiword 
location name may include a personal name (and, rarely, an organization 
name) (Cristina FORT). 


2 Semantic and syntactic patterns of multiword names 


Single-word location name; it is specified by an external location trigger 
(Rıver Nile). 


Multiword location name; it is specified by an external location trigger 
(voLcAno Klyucevskaya Sopka). 


Single-word organization name (it may coincide or not with a personal 
name or a location name) (Matalan, Poundland). 


Multiword organization name (it may (partially) coincide or not with a 
personal or a location name) (Mercedes Benz). 


Multiword organization name, comprising an internal organization trigger 
as an integral part of the proper name. Another organization trigger of 
the same type cannot be added. The trigger, which is part of the name, 
cannot be omitted either. The multiword organization name may include 
a personal or a location name (Princess Basma Youth Resource CENTER, 
Melbourne Grammar SCHOOL). 


Single-word organization name; it is specified by an external organization 
trigger (SUPERMARKET Galaxy). 


Multiword organization name; it is specified by an external organization 
trigger (the comPANY Business Models Inc.) 


5 Semantic patterns for persons, locations and 


organizations 


Names can be grouped into different semantic classes and subclasses with re- 
spect to the properties of their referents (explicated by triggers). A name from a 
given class (personal, location or organization) selects triggers from a particular 
set of semantic subclasses. For example, complex personal names are combined 
with triggers that define a legislative job title, executive job title, judicial position, 
academic position, academic title, military rank, and profession. The permissible 
combinations between types of names (proper nouns, MWEs), and semantic sub- 
classes of triggers determine the semantic patterns applicable to the personal, lo- 
cation and organization names. The semantic patterns we propose show semantic 
compatibility valid for a particular semantic class and describe the permissible 
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combinatory options. For example, a personal name can be extended with a kin- 
ship term (i.e., the beautiful STEP-DAUGHTER of John from Paris, Anne Nicole) and 
the kinship term can be specified in various ways and restricted for possessor and 
location, thus the respective semantic pattern is: (modifier: referent specification 
phrase) - trigger: kinship term - (complement: possessor phrase) - (complement: 
location phrase) - personal name. 

As triggers refer to concepts, the semantic relations in which they are involved 
should be universal and must hold among the relevant concepts in any language. 
Thus, the semantic patterns describe language-neutral relations and can be re- 
garded as universal structures with correlating language-specific syntactic pat- 
terns. 

Following the detailed hierarchy employed by Giuliano (2009) for automatic 
classification of personal NEs, we can conclude that every common noun that 
determines the referent of a personal name can be a trigger, i.e., words such 
as chess-player, singer, footballer, etc. Magnini et al. (2002) use WordNet hierar- 
chy for identification of large sets of triggers - hyponyms of high-level synsets 
such as {person}, {location}, (organization). Some authors suggest verb triggers 
appearing in the local context of NEs (Zhang et al. 2004), e.g., for water bodies 
(like rivers) the verb flooded in The Sava flooded the village indicates that Sava 
is a river and not a person. There are detailed classifications of NEs (of more 
than 200 categories; cf. Sekine & Nobata 2004), while other classifications build 
shallow hierarchies with the major classes on the top and sets of subtypes with 
different granularity at the low levels (ACE 2018; Fleischman & Hovy 2002). 

In our study, we distinguish the following semantic subclasses for person, lo- 
cation and organization names and their triggers: 


* Persons and personal triggers -legislative job title: prime minister; execu- 
tive job title: executive officer; judicial position: judge; academic position: 
associate professor; military rank: major general; profession: engineer; aca- 
demic title: Ph.D.; true honorific: Mister / Mr; aristocratic title: Prince; reli- 
gious title: Bishop; kinship term: sister; holy title: Saint. 


« Locations and location triggers -natural: river; public: monument; commer- 
cial: restaurant; infrastructure: boulevard. 


e Organizations and organization triggers -business: company; political: po- 
litical party; government: ministry; media: publishing house; human / non- 
government: association. 
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We classify proper names (persons, locations and organizations) in the pat- 
terns (A) to (I) below according to their shared features. Patterns are described 
in terms of the categories (a)-(d): (a) the semantic subclass of the trigger; (b) the 
type of the proper name that selects triggers; (c) obligatoriness / optionality of 
the trigger manifested by an internal or external trigger with respect to the name; 
(d) the semantic pattern that the proper name evokes. 


5.1 Pattern A 


(a) Semantic class of the trigger: legislative job title, executive job title, judicial 
position, academic position, academic title, military rank, profession. Specifica- 
tion of military ranks and top-level legislative, executive, and judicial triggers is 
not allowed: "prime minister of finance; lower level legislative triggers can be 
specified: engineer in automatics. (b) Type of the proper name: personal name 
extended or substituted by a family name. (c) External trigger. (d) Semantic pat- 
tern: (referent specification phrase) - trigger — (domain specification phrase) - 
(possessor phrase) - (affiliation phrase) - (location phrase). 


Example: English: (his | Stefan's) (new) PROFESSOR (of law) (at the University) 
(in Plovdiv) Ivan Ivanov. 


5.2 Pattern B 


(a) Semantic class of the trigger: aristocratic title, religious title. (b) Type of the 
proper name: personal name or family name. Some aristocratic and religious 
titles are selected only by a personal name (Pore Francis), while others are se- 
lected by a family name (LorD Orsini). A trigger can also be part of a personal 
name (for distinguished persons) but no separate pattern is defined for this type 
of name. (c) External trigger. (d) Semantic pattern: (referent specification phrase) 
— TRIGGER - (affiliation phrase) - (location phrase). 


Example: English: the (new) METROPOLITAN (of the Church) (in San Francisco) 
Iona. 


5.3 Pattern C 


(a) Semantic class of the trigger: kinship term. (b) Type of the proper name: per- 
sonal name (rarely modified by family name(s)). (c) External trigger. (d) Semantic 
pattern: (referent specification phrase) — TRIGGER - (possessor phrase) - (location 
phrase). 


Example: English: (his | Ivan's) (blond) STEP-BROTHER (from Sofia) Stefan. 
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5.4 Pattern D 


(a) Semantic class of the trigger: holy title (a limited set of words). (b) Type of the 
proper name: personal name (rarely modified or substituted by a nickname). (c) 
External trigger. (d) Semantic pattern: (referent specification phrase) - TRIGGER 
- (location phrase). 


Example: English: (miraculous) SAINT (from Patara) Nicholas. 


5.5 Pattern E 


(a) Semantic class of the trigger: true honorific. (b) Type of the proper name: 
personal name extended or substituted by a family name. (c) External trigger. (d) 
Semantic pattern: - trigger 


Example: English: MONSIEUR Ivan Ivanov. 


5.6 Pattern F 


(a) Semantic class of the trigger: location. (b) Type of the proper name: location 
name. (c) External trigger. (d) Semantic pattern: (referent specification phrase) — 
TRIGGER - (specification phrase) - (possessor phrase) - (location phrase) 


Example: English: the (beautiful) crrv (near the big river), Plovdiv. 


5.7 Pattern G 


(a) Semantic class of the trigger: location. (b) Type of the proper name: location 
name. (c) Internal trigger. (d) Semantic pattern: (referent specification phrase) — 
internal TRIGGER - (location phrase) 


Example: English: the (beautiful) MOUNT Fuji (in Japan). 


5.8 Pattern H 


(a) Semantic class of the trigger: organization. (b) Type of the proper name: orga- 
nization name. (c) External trigger. (d) Semantic pattern: (referent specification 
phrase) - TRIGGER - (domain specification phrase) - (possessor phrase) - (affili- 
ation phrase) - (location phrase). 


Example: English: the (new) company (of his friends) (in Athens), Tetracom. 
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5.9 PatternI 


(a) Semantic class of the trigger: organization. (b) Type of the proper name: orga- 
nization name. (c) Internal trigger. (d) Semantic pattern: (referent specification 
phrase) - internal TRIGGER - (location phrase). 


Example: English: the (new) Hebros BANK (in Athens). 


6 Language-specific syntactic patterns for persons, 
locations and organizations 


We define the semantic patterns evoked by different types of proper names when 
combined with triggers, and the syntactic patterns that involve combinations of: 
modifiers, one or several, semantically restricted by the head proper noun and 
the trigger, and complements, semantically restricted by the trigger. 

The syntactic patterns are language-specific and differ for personal, location 
and organization names. The syntactic patterns may involve combinations of 
adjectival modifiers in pre- or post-nominal position, one or several; pronoun 
modifiers in pre-nominal position (possessive and demonstrative); complements 
in post-nominal position, one or several; and a noun modifier in pre- or post- 
nominal position, alternating with a prepositional phrase.? Adjectival modifiers 
(that in Bulgarian, French, Greek and Serbian agree with the head noun in gender 
and number) may indicate physical shape, status, etc. Complements may indicate 
(domain) specification, affiliation, location, possessor, and may be prepositional 
or case complements, depending on the language structure. 

Multiword names have the structure of a noun phrase and exhibit specific 
properties with respect to constituency of the head noun and the components, 
including various constraints on modifiers, complements, clitics (in Bulgarian 
and Greek), etc. 

The syntactic patterns represent language-specific grammatical features and 
dependencies and how these features and dependencies are manifested in a par- 
ticular language. One or more syntactic patterns from one or different languages 
may correspond to the same semantic pattern. The syntactic patterns, as they 
are presented in this paper, define constituency and reflect the morphological 
and syntactic structure of a particular language, although they do not strictly de- 
scribe phrase structure and grammatical dependencies. However, the syntactic 


"The noun modifier — prepositional phrase alternation is not described in the syntactic patterns. 
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patterns are formal enough to code the linguistic information correctly and to 
allow for the conversion to some formalism. 

Syntactic patterns corresponding to the largely universal semantic patterns, 
described in 85, are formulated for English, Bulgarian, French, Greek, and Serbian. 
The generalizations for semantic patterns and respective syntactic patterns were 
constructed on the basis of observations and classifications made on dictionaries 
of NEs, annotated corpora of NEs and grammars for NE recognition developed 
so far (Krstev et al. 2013; Koeva & Dimitrova 2015). 


6.1 Syntactic pattern A (single family name or multiword personal 
name) 


Characteristics shared by the five languages?: i) Triggers are placed to the left of 
the personal name; a complex trigger phrase is likely to be an apposition*. Exam- 
ples: English: the new PROFESSOR of Sustainable Agriculture and Climate Change, 
Chris Smith or Chris Smith, the new PROFESSOR of Law at the University's Hu- 
manities Institute; French: Le nouveau PROFESSEUR de morale de l'Université de 
Fribourg, Thierry Collaud “the new PROFESSOR in ethics at the University of Fri- 
bourg, Thierry Collaud'or Thierry Collaud, le nouveau PROFESSEUR de morale 
de l'Université de Fribourg "Ihierry Collaud, the new PROFESSOR in ethics at the 
University of Fribourg’; ii) If no modifiers or complements exist, the trigger is 
indefinite (except for Greek where the article is obligatory); otherwise, it is def- 
inite. Examples: Bulgarian: MINISTAR Kuneva “MINISTER Kuneva' versus noviyat 
MINISTÄR v pravitelstvoto Ivan Dimitrov 'new-the MINISTER in government-the 
Ivan Dimitrov’; English: PROFESSOR Chomsky versus the PROFESSOR of linguistics 
Noam Chomsky; iii) personal names can be extended with one or more (rarely 
more than two) proper names: a nickname, a patronym and/or a family name, 
constituting a MWE. Example: Serbian: Dr Slavica Dukié-Dejanovié, PREDSED- 
NIK Narodne skupštine ‘Dr. Slavica Đukić-Dejanović, PRESIDENT of National As- 
sembly’.° 


?Language specific characteristics were also formulated but due to limitation of space they are 
represented only in the syntactic patterns. 

“The restrictive apposition of a proper noun (whose omission changes the meaning of the sen- 
tence) is covered by the syntactic patterns (the new PROFESSOR of Law, Chris Smith). The non- 
restrictive apposition (Chris Smith , the new PROFESSOR of Law) is not included due to limi- 
tation of space. In this paper, the term apposition is used for the “non-restrictive apposition” 
only. 

"The vertical bar - |, separates alternatives. The question mark indicates zero or one occurrences 


6x» 


of the preceding element. The asterisk “*”, indicates zero or more occurrences of the preceding 
element. Parentheses "OT, are used to define the scope and precedence of the operators. The 


equality sign -, indicates the semantics of the prepositions and case phrases. 
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English: (((DefArt|GenDet|GPossPron) Adj* TRIGGER (PP inlof|for = DomSpec)? 
(PP at = Aff)* (PP in = Loc)?) | (DefArt TRIGGER (PP inlof|for = DomSpec)? 
(PP at|of = Aff)* (PP in = Loc)?) | INDEFTRIGGER) PerN 


Bulgarian: (((DefAdj (PossCL)?) | DefPossPron) Adj* TRIGGER (PP po|na|za = Dom- 
Spec)? (PP na|pri|v|kám|ot = Aff)*) (PP ot = Loc)?) | (DEFTRIGGER (PossCL)? 
(PP po|na|za = DomSpec)? (PP na|pri|v|kám|ot = Aff) (PP ot = Loc?) | 
INDEFTRIGGER) PerN 


French: (((DefArt|PossPron) Adj* TRIGGER (PP en|de = DomSpec)? (PP à =Aff)* 
(PP a = Loc)?) | (DefArt TRIGGER Adj* (PP en|de = DomSpec)? (PP ajde = 
Aff) (PP à = Loc)?) | INDEFTRIGGER) PerN 


Greek: ((DefArt Adj* (PossPron)? TRIGGER (GenP = DomSpec)? (PP oe = Aff)" (PP 
c£ = Loc)?) | (DefArt TRIGGER (GenP = DomSpec)? (GenP = Aff)" (PP oe = 
Loc)?) | DEFTRIGGER) PerN 


Serbian: ((Adv | PossPron)? DefAdj* TRIGGER (PP za| Genf = DomSpec)? (PP ulpri 
| GenP = Aff)* (PP ulzaliz GenP = Loc)?) PerN 


Table 1: Syntactic pattern A - an example translated in the five lan- 
guages. The examples do not illustrate all variants. 


English Spain's SECRETARY of State for Foreign Affairs, Gonzalo de Benito 
Secades 

Bulgarian Dárzavniyat SEKRETAR na vansnite raboti na Ispaniya, Gonzalo 
de Benito Sekades 


French le MINISTRE espagnol des Affaires Etrangéres, Gonzalo de Benito 
Secades 

Greek o Ionavög Yrovpyós E&orepwov, Gonzalo de Benito Secades 

Serbian Državni SEKRETAR Za spoljne poslove Španije, Gonzalo de Benito 
Sekades 


6.2 Syntactic pattern B (single- or multiword personal name) 


Characteristics shared by the five languages: i) The trigger phrase is placed in 
front of the personal name but the appositive order can also be found, especially 
if the trigger phrase is complex. Examples: English: Venerable Dionysius, the 
ARCHIMANDRITE of St Sergius’ Monastery; Serbian: njegovo preosvestenstvo EPISKOP 
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niški Irinej “His Grace BısHoP of-Niš Irinej’; ii) If no trigger modifiers or com- 
plements exist, the trigger is indefinite (except for Greek where the article is 
obligatory); otherwise, it is definite. Examples: Bulgarian: PATRIARH Maksim 
‘PATRIARCH Maxim’; French: L’ARCHEVEQUE de Paris, Monseigneur André Vingt- 
Trois ‘the ARCHBISHOP of Paris, Monsignor André Vingt-Trois’; Le bienheureux 
PERE Brottier “the blessed FATHER Brottier’; iii) personal names can be extended 
with one or more (rarely more than two) proper names: a nickname, a patronym 
and/or a family name. These constituents form a complex name (MWE). Example: 
Greek: o npöopara xeıporovndeig Zeboouäroroc MyntpomoAitns KepadAnviac, 
matépac Tewpyıog Zanovvüg ‘the newly appointed Most Reverend BisHop 
METROPOLITAN of Kefalonia, FATHER Georgios Sapounas”. 


English: ((DefArt Adj* TRIGGER (PP of = Aff)" (PP in = Loc)?) | (DefArt TRIGGER 
(PP at/of = Aff)" (PP in = Loc)?) | INDEFTRIGGER) PerN 


Bulgarian: (((DefAdj (PossCL)? | DefPossPron) Ad" TRIGGER (PP na|pri|v|kám| ot 
= Aff)? (PP ot = Loc)?) | (DEFTRIGGER (PP na]|pri|v|kám|ot = Aff)? (PP ot = 
Loc)?) | INDEFTRIGGER) PerN 


French: ((DefArt Adj* TRIGGER (PP de = Aff) (PP älen = Loc)?) | (DefArt 
TRIGGER* Adj* (PP de = Aff)* (PP den = Loc)?) | INDEFTRIGGER) PerN 


Greek: ((DefArt Adj* TRIGGER (GenP = Aff)* (GenP = Loc)?) | (DefArt TRIGGER 
(GenP = Aff | Loc)* (PP oe = Loc)?) | DEFTRIGGER) PerN 


Serbian: (((PossPron | PossAdj) TRIGGER) | (DefAdj* TRIGGER) | (TRIGGER Def- 
Adj?)) (PP ulza| GenP = Aff)? (PP u|za|iz GenP = Loc)?) PerN DefAdj* 


Table 2: Syntactic pattern B - an example translated in the five lan- 
guages. 


English the new BISHOP of the Christian Catholic Church of Switzerland, DR 
Harald Rein 

Bulgarian noviyat EPISKOP na Hristiyanskata katoliceska cárkva v Sveycariya, 
D-R Harald Rein 


French le nouvel EvEQUE de l'Eglise catholique-chretienne de la Suisse, DR 
Harald Rein 

Greek o véog Apxıeniokonog ng Xpiotiavikýs Kkofokune ErkAnolag 
me EABetiac, Ap. Harald Rein 

Serbian novoimenovani BISKUP Starokatolicke crkve Svajcarske, Dr Harald 
Rajn 


44 


2 Semantic and syntactic patterns of multiword names 


6.3 Syntactic pattern C (single- or, rarely, multiword personal name) 


Characteristics shared by the five languages: i) The usual position of the trigger is 
in front of the personal name. The reverse order indicates apposition. Examples: 
Bulgarian: moyata hubava SESTRA Ana ‘my-the beautiful sisrER Anna’; Greek: 
n adepgpr) tov IIétpov, n Mapía “the sisTER of Peter, the Maria’;  Mapío, n 
aóepor| tov IIécpov “the Maria, the sIsTER of Peter”. ii) The trigger is accompa- 
nied by modifiers or complements. Examples: English: his older STEP-BROTHER 
from Paris, Stefan; Serbian: njegov rođeni BRAT četvorogodišnji Zoran ‘his birth 
BROTHER 4-year-old Zoran’; iii) Ihe phrase headed by the trigger is definite. Ex- 
amples: French: la seur de Marc, Marie ‘the sIsTER of Marc, Maria’ vs. sa SEUR, 
Marie ‘his SISTER, Maria’; son BEAU-FRÈRE de Paris, Jean ‘his BROTHER-IN-LAW 
from Paris, John’. 


English: (((DefArt | GenDet | PossPron) Ad" TRIGGER (PP from = Loc)?) | (Def- 
Art Adj* TRIGGER (PP of = Poss?)? (PP from = Loc)?)) PerN 


Bulgarian: (((DefAdj (PossCL)?) | DefPossPron) Adj* TRIGGER (PP ot = Loc)?) | 
((DEFTIGGER (PossCL)?) (PP ot = Loc)?) | (DefAdj Adj* TRIGGER (PP na = 
Poss)?) | (DEFTRIGGER (PP na = Poss)?)) PerN 


French: (DefArt | PossPron) Adj* TRIGGER ((PP de = PerN) | (PP de = Loc)?) PerN 


Greek: (DefArt Adj* TRIGGER ((GenP = PerN) | (GenP = Poss))? (PP anró DefArt 
= Loc)?) DefArt PerN 


Serbian: (PerN* TRIGGER* (DefAdj* | (GenP = PerN))? (PP iz = Loc)?) PerN | (Poss- 
Pron? DefAdj* TRIGGER* (PP iz = Loc)?) PerN 


Table 3: Syntactic pattern C - an example translated in the five lan- 


guages. 
English his beautiful SISTER-IN-LAW from Athens, Maria 
Bulgarian negovata krasiva SNAHA ot Atina, Maria 
French sa jolie BELLE-SOEUR d’Athenes, Maria 
Greek n ópopon Kovvıada Tov ano trjv AOrjvo, n Mapia 
Serbian njegova lepa SNAJA iz Atine, Marija 


The possessor phrase is shown only for the kinship terms. 
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6.4 Syntactic pattern D (single- and, rarely, multiword personal 
name) 


Characteristics shared by the five languages: i) The trigger appears before the 
personal name but a complex trigger phrase often occurs in apposition. Exam- 
ples: English: Sarvr Haralambos, the Holy MARTYR of Magnesia; Serbian: SVETI 
mučenik i arhiđakon Lavrentije ‘SAINT martyr and archdeacon Lavrentije’. ii) 
If no modifiers or complements exist, the trigger is indefinite (except for Greek 
where the article is obligatory); otherwise, it is definite. Examples: Bulgarian: 
SvETI Nikola ‘Saint Nicholas”; French: le sarnT de l'Arcadie: Charles De Menou 
D'Aulnay ‘the saint of Arcadia: Charles De Menou D’Aulnay’. 


English: (DefArt Ad" TRIGGER (PP of|from = Loc)? PerN) | (DefArt Adj" TRIG- 
GER PerN (PP of from = Loc)?) | (TRIGGER PerN (PP of |from = Loch?) 


Bulgarian: (((DefAdj (PossCL)? | DefPossPron) Adj* TRIGGER (PP nalot = Loc)?) | 
(DefAdj Adj* TRIGGER (PP najot = Loc)?) | (DEFTRIGGER (PP nalot = Loc)?) 
| INDEFTRIGGER) PerN 


French: (DefArt Adj* TRIGGER (PP de = Loc)? PerN) | (TRIGGER Adj* PerN (PP de 
= Loc)?) 


Greek: (DefArt Adj* TRIGGER (GenP = Loc)? PerN) | (DefArt TRIGGER PerN (GenP 
= Loc)? ) 


Serbian: (PossPron? DefAdj* TRIGGER ((PP iz|u|sa = Loc)? DefAdj? (N)?) PerN 
PossAdj? | PerN PossAdj? (PP iz|u|sa = Loc)?) 


Table 4: Syntactic pattern D - an example translated in the five lan- 
guages. 


English the Holy MARTYR Chrysostomos of Smyrna 
Bulgarian Svetiyat MÁCENIK Hrisostom ot Smirna 


French le Saint MARTYR Chryssostomos de Smyrne 
Greek o Aytoc leponäprupag Xpvodotopos Xpópvnc 
Serbian Sveti MUCENIK Hrizostom Smirnski 
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6.5 Syntactic pattern E (family name or multiword personal name) 


Characteristics shared by the five languages: i) The external trigger usually ap- 
pears before the personal name, except for some triggers that can also appear af- 
ter the personal name. Examples: English: Mr. Smith; Serbian: Dusan Raskovic, 
DIPL. OEC.; ii) No modifiers and/or complements are allowed. Example: French: 
M. Dupont ‘Mr. Dupont’; iii) If the trigger is indefinite, an article is not permis- 
sible (except for Greek where the article is obligatory). Example: English: *(the) 
Dr. Livingstone; iv) personal names can be extended with one or more (rarely 
more than two) proper names: a nickname, a patronym and/or a family name 
that form a complex name (MWE). Example: Bulgarian: Gospopin Ivan Ivanov 
“MISTER Ivan Ivanov’. 


English: Bulgarian: French: INDEFTRIGGER PerN 
Greek: DefArt TRIGGER PerN 


Serbian: (TRIGGER PerN) | (PerN TRIGGER) 


Table 5: Syntactic pattern E - an example translated in the five lan- 
guages. 


English Dr. Mary Andrew Smith 
Bulgarian p-r Meri Andryu Smit 
French Dr Mary Andrew Smith 
Greek n Ap. Maipn Avtpiov Zu 
Serbian DR Meri Endru Smit 


6.6 Syntactic pattern F (single- or multiword location name) 


Characteristics shared by the five languages: i) The trigger appears before the 
personal name. A heavy trigger phrase is often found in apposition. Examples: 
Greek: n ópopqor nöAN vov Hapıoiwv / tov Hapıoıov ‘the beautiful crrv of Paris’; 
n Lavtopivn, to zio ópopoo vnoí tns EAAdSac ‘the Santorini, the most beautiful 
ISLAND in Greece”. ii) The phrase headed by the trigger is definite. Examples: En- 
glish: the beautiful crrv of artists, Plovdiv; our beautiful crrv, Plovdiv; French: la 
belle viLLE des mille fontaines, Aix-en-Provence “the beautiful crrv of thousand 
fountains, Aix-en-Provence”; iii) Location names can be MWEs. Examples: Bul- 
garian: HRAM-PAMETNIK Sveti Aleksandar Nevski ‘CATHEDRAL Saint Alexander 
Nevski’; Serbian: NADOSLI Beli Timok 'n1s1NG Beli Timok’. 
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English: (((DefArt | GenDet | PossPron) Adj* TRIGGER (PP of = Spec)? (PP in = 
Loc)?) | (DefArt TRIGGER (PP of = Spec)?) | INDEFTRIGGER) LocN 


Bulgarian: (((DefAdj (PossCL)? | DefPossPron) Adj* TRIGGER (PP na = Spec)? (PP 
vIna|pri = Loc)?) | (DEFTRIGGER (PP na = Spec)? (PP v|na|pri = Loc)?) 
INDEFTRIGGER) LocN 


French: (((DefArt | PossPron) Adj* TRIGGER (PP de = Spec)? (PP de = Loc)?) 
(DefArt TRIGGER Adj* (PP de = Spec)?)) LocN 


Greek: (((DefArt | PossPron) Adj* TRIGGER (GenP = Spec)? (PP oe = Loc)?) 
(DefArt TRIGGER (GenP)?)) LocN 


Serbian: ((PossPron | PossAdj)? DefAdj* TRIGGER (PP ulnalprilpod|u blizini| GenP 
- Loc)?) LocN 


Table 6: Syntactic pattern F - an example translated in the five lan- 


guages. 
English the most romantic CITY in the world, Paris 
Bulgarian | nay-romantiéniyat GRAD v sveta, Pariž 
French la plus romantique VILLE dans le monde, Paris 
Greek N mo poyavrırr) MOAN Tov xóopov, To Tapior 
Serbian najromantiéniji GRAD na svetu, Pariz 


6.7 Syntactic pattern G (multiword location name) 


Characteristics shared by the five languages: i) Ihe internal trigger is part of 
the location name, thus the location name is always a MWE. Examples: Bulgar- 
ian: nasiyat hubav GRAD Novi han “our-the beautiful crry Novi han’; Greek: 
o Ivówóc Qkeavós ‘the Indian Ocean’; ii) A location name with an internal 
trigger is fixed, the order of constituents cannot be changed and insertions are 
not allowed. Examples: Bulgarian: nasata Stara PLANINA ‘our-the Stara PLAN- 
INA’, "PLANINA stara; French: le célèbre MONT Blanc “the famous MonT Blanc’, 
*Blanc Mont. iii) The location name may contain a personal or a location name 
(rarely an organization name). Example: English: Minnesota RIVER; iv) The in- 
ternal trigger may be specified by the same range of modifiers and complements 
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permissible for the trigger.’ Example: Bulgarian: Černi vrAH ‘Black PEAK’; v) 
External trigger can be added if different from the internal one. Example: En- 
glish: Crrv of Colorado SPRINGS; vi) Heavy trigger phrases often occur as appo- 
sitions. Example: Serbian: Novo Brdo, najveci rudarski GRAD u Srbiji i na celom 
Balkan- skom poluostrvu ‘New Hill, biggest mining TOWN in Serbia and on the 
entire Balkan Peninsula”. 


English: ((DefArt | GenDet | PossPron) Adj* MWLocN (PP in = Loc)?) | (DefArt 
MWLocN (PP in = Loc)?) | MWLocN 


Bulgarian: (((DefAdj (PossCL)?) | DefPossPron) Adj* MWLocN (PP v|na]|pri = 
Loc)?) | (DefMWLocN (PP v|na|pri = Loc)?) | MWLocN 


French: ((DefArt | PossPron) Adj* MWLocN (PP de = Loc)?) | (MW LocN (PP de 
- Loc)?) | MWLocN 


Greek: ((DefArt | PossPron) Ad" MWLocN (GenP = Loc)?) | (DefMWLocN (GenP 
- Loc)?) | MWLocN 


Serbian: ((PossPron | PossAdj)? DefAdj* (PP u|na|pri|pod|u blizini | Genf = Loc)?) 
MWLocN | Genf = Loc)?) | MWLocN 


Table 7: Syntactic pattern G - an example translated in all five lan- 
guages. 


English the vast Great PLAINS in the United States 
Bulgarian neobyatnite Golemi RAVNINI v Sáedinenite Stati 


French les Grandes PLAINES aux Etats-Unis 
Greek ta Great PLAINS me Apepikýg 
Serbian prostrane Velike RAVNICE u Sjedinjenim DrZavama 


6.8 Syntactic pattern H (single- and multiword organization name) 


Characteristics shared by the five languages: i) The trigger may either be placed 
before or after the organization name in case of apposition. Examples: Bulgarian: 
hranitelnata KOMPANIYA “Danon” 'nutritional-the COMPANY Danone”; English: 


"For simplicity, not all variants are presented in the syntactic pattern (applicable also to Syn- 
tactic pattern I in 86.9). 
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Apache Conr. ii) The phrase headed by the trigger is definite. If the trigger is a 
single-word one or specified for domain, the trigger phrase may be indefinite (ex- 
cept for Greek where the article is obligatory). Examples: English: the company of 
Buffett, Berkshire Hathaway; Bulgarian: investicionen FOND "Razvitie" 'invest- 
ment FUND Razvitie’; French: la nouvelle COMPAGNIE Santus ‘the new COMPANY 
Santus’, notre nouvelle COMPAGNIE Santus ‘our new COMPANY Santus’, la nouvelle 
COMPAGNIE de Pierre, Santus 'the new COMPANY of Pierre, Santus’; iii) The orga- 
nization name can be a MWE - either a complex personal or location name or 
a fixed multiword organization name. Examples: Bulgarian: novoto UCILISTE za 
deca s uvreden sluh v Sofiya “Prof. Dr. Deco Denev" 'new-the school for children 
with impaired hearing in Sofia Prof. Dr. Deco Denev’; Greek: n petaddeutikí 
etatpia EAAnvırög Xpuods ‘the mining company Hellas Gold’, n EAAnvırög 
Xpvoög, neradAevrıkr) etaipica ‘the Hellas Gold mining company’; Serbian: FAB- 
Machine Factory “Ivo Lola Ribar”. 


ar 


RIKA mašina “Ivo Lola Ribar 


English: (((DefArt | GenDet | PossPron) Adj* TRIGGER (PP in|at = DomSpec)? (PP 
inat = Loc)?) | (DefArt Adj* TRIGGER (PP inat = DomSpec)? (PP in|at = Loc)?) 
| TRIGGER) OrgN 


Bulgarian: (((DefAdj (PossCL)?) | DefPossPron) Adj* TRIGGER (PP po|na|za = Dom- 
Spec)? (PP v|na|pri = Loc)?) | ((DEFTRIGGER (PossCL)?) (PP polnalza 
=DomSpec)? (PP v|na|pri = Loc)?) | TRIGGER) OrgN 


French: (((DefArt | PossPron) Adj* TRIGGER (PP de = DomSpec)?) | (DefArt Adj* 
TRIGGER (PP de = PersN)?) | (DefArt TRIGGER)) (PP oe DefArt = OrgN) 


Greek: ((DefArt Adj* TnicGzr (PossPron)? ((PP vo = DomSpec?) | (Gen = Dom- 
Spec)?) | (DefArt Adj* TRIGGER (GenDet)?) | (DefArt TrIGGER)) (PP oe Def- 
Art - OrgN) 


Serbian: (PossPron? DefAdj* TRIGGER (PP za | GenP = DomSpec)? (PP u|pri| GenP 
= Aff)? (PP iz|u| GenP = Loc)? OrgN) | (OrgN TRIGGER) 
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Table 8: Syntactic pattern H - an example translated in the five lan- 
guages. 


English China's investment BANK, China International Capital Corpora- 
tion Limited 

Bulgarian kitayskata investicionna BANKA China International Capital 
Corporation Limited 


French la BANQUE d'Investissements en Chine, China International Cap- 
ital Corporation Limited 

Greek n Tpúxela Enevöboewv vn Kivac, n China International Cap- 
ital Corporation Limited 

Serbian kineska investiciona BANKA, Kineska medunarodna kapitalna 
korporacija 


6.9 Syntactic pattern I (multiword organization name) 


Characteristics shared by the five languages: i) The trigger is an integral part of 
the organization name, thus the organization name is always a MWE. Examples: 
English: the European BANK for Reconstruction and Development in Ser- 
bia; the AssociATION of Chartered Certified Accountants; Greek: n Tp&neGo 
Europiov kot Avantvéns tns Mavpns GéXaocoasc ‘the Black Sea Trade and 
Development BANK”. ii) Organization names containing an integral trigger are 
fixed, the order of constituents cannot be changed and insertions are not al- 
lowed. Examples: Bulgarian: Evropeyska BANKA za vázstanovyavane i razvi- 
tie 'European BANK for Reconstruction and Development’; novosázdadeniyat 
Evropeyski FOND za strategiceski investicii newly-found-the European FUND 
for Strategic Investment’; iii) organization names with an integral trigger can 
contain a personal, location or organization name. Example: Serbian: Memori- 
jalni CENTAR “Josip Broz Tito” ‘Memorial CENTER “Josip Broz Tito”; iv) The 
internal organization trigger can be specified by the same range of modifiers and 
complements permissible for it in a regular use. Example: French: L’AssoCIATION 
des Historiens ‘the Assocation of Historians’; v) Rarely, an organization trig- 
ger, different from the integral trigger, can specify the multiword organization 
name. Examples: Bulgarian: SÁvuz na tárgovcite v Balgariya ‘UNION of traders- 
the in Bulgaria’ ; Asociacrya "SÁvuz na tárgovcite v Bálgariya” ASSOCIATION 
UNION of traders-the in Bulgaria’. 
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English: (DefArt | PossPron) Adj* MWOrgN (PP atlin = Loc)?) | (DefMWOrgN 
(PP at|in = Loc)?) | MWOrgN 


Bulgarian: (((DefAdj (PossCL)?) | DefPossPron) Adj? MWOrgN (PP vinalpri = 
Loc)?) | (DefMWOrgN (PP v|na|pri = Loc)?) | MWOrgN 


French: (DefArt | PossPron) Adj* MWOrgN (PP a de = Loc)? 


Greek: ((DefArt | PossPron) Adj"? MWOrgN (PP oe = Loc)?) | (DefArt MWOrgN 
(GenP = Loc)?) | MWOrgN 


Serbian: (PossPron? DefAdj* MWOrgN (GenP = Poss)? (PP iz|u| GenP = Loc)?) | 
MWOrgN 


Table 9: Syntactic pattern I - an example translated in all five lan- 
guages. 


English World AssocIATION for Small and Medium Enterprises 
Bulgarian Svetovna ASOCIACIYA za malki i sredni predpriyatiya 


French ASSOCIATION Mondiale des Petites et Moyennes Entreprises 
Greek A1sdvng Xóvóeoyoc Mikpopecaiov Eseiprjotov 
Serbian Svetsko UDRUZENJE malih i srednjih preduzeca 


7 Comparison of the five languages 


At the semantic level, languages do not differ (significantly). The semantic pat- 
terns of proper names define the common semantics, regardless of the language 
in which it is realized: semantic patterns are language-neutral. Languages differ 
in lexical and phrasal categories, constituency, word order permutations and al- 
terations. The differences in word order and alterations insert some nuances in 
the expressed meaning, i.e., the viewpoint of the speaker, but they do not alter 
the general meaning. The syntactic patterns of proper names show the correspon- 
dences among languages at the syntactic level: syntactic patterns are language- 
specific. A semantic pattern is a representation that can be linked with different 
syntactic frames in different languages and, vice versa, syntactic patterns from 
different languages may share a single semantic pattern. Thus, syntactic patterns 
make explicit the similarities and differences in the grammatical structure of the 
five languages. 
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The structure ofthe language-neutral semantic and language-specific syntactic 
patterns can be represented as a graph whose nodes are semantic and syntactic 
patterns while the arcs represent different languages. More than one language- 
specific syntactic pattern may be linked to one language-neutral semantic frame; 
in such a case, syntactic patterns are synonymous to the extent that they rep- 
resent a common semantic structure. Through this type of representation, we 
offer an interlingual mapping of the syntactic structures of named entities in the 
five languages. Some of the most distinctive grammatical characteristics of NEs 
in English, Bulgarian, French, Greek and Serbian with respect to the single and 
multiword morphology and syntax will be outlined below. 


7.1 Grammatical categories of dependent constituents 


The syntactic patterns of English proper name triggers involve combinations of 
adjectival modifiers in pre-nominal position, one or several (they can be preceded 
by a definite article) ( the great POET Burns’); possessive pronoun modifiers in 
pre-nominal position (‘Welcome our new PROFESSOR, Jennifer S. Locke!); prepo- 
sitional complements in post-nominal position; a noun modifier in pre-nominal 
position, alternating with a genitive determiner and a prepositional phrase, e.g., 
the Grieg Piano CONCERTO vs. Grieg's Piano CONCERTO vs. the Piano CON- 
CERTO by Grieg. 

In Bulgarian, the syntactic patterns for proper name triggers exhibit the fol- 
lowing combinations: adjectival modifiers in pre-nominal position (noviyat PRED- 
SEDATEL Petrov 'new-the CHAIR Petrov”); possessive pronoun modifiers in pre- 
nominal position alternating with a possessive pronoun clitic in post-nominal 
position (moyata sestra Ana 'my-the SISTER Ana’ vs. SESTRA mi Ana 'SISTER 
my.PossCL Ana”); prepositional complements in post-nominal position (Kom- 
PANIYATA na Ivan "Elit" "COMPANY-THE of Ivan “Elit””); and a noun modifier in 
pre-nominal position, alternating with a PP (karate INSTRUKTORÄT Ivan 'karate 
TRAINER-THE Ivan’ vs. INSTRUCTORÁT po karate Ivan "TRAINER-THE in karate 
Ivan’). 

In French, the phrase headed by a trigger is definite (with an alternation of 
the phrase with a definite article or possessive pronoun: la belle viLLE des Mille 
Fontaines, Aix-en-Provence ‘the beautiful crry of thousand fountains, Aix-en- 
Provence’; notre belle VILLE, Paris ‘our beautiful crrv, Paris’). The proper name 
can be introduced by a preposition: la belle vn tr de Paris ‘the beautiful crrv of 
Paris’; notre belle viLLE de Paris ‘our beautiful crrv of Paris’. 

In Greek, simple and multiword proper names are preceded by a definite arti- 
cle, e.g., to TIapioı ‘the Paris’, o. Hvopéves IloXıteieg Ayepıcng ‘the United 
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States of America’. The phrase headed by the trigger is also definite: n óuopon 
MOAN TOV KATAPPAKTOV, n Eóe06« ‘the beautiful crrv of waterfalls, the Edessa’; 
n Opoper pas ón, n Edeooa ‘the beautiful our.PossCL crrv, the Edessa’. A lo- 
cation name can be put in the genitive case: n ópopon xóAr rou Taproroú ‘the 
beautiful crry of Paris’. In that case, the use of the possessive pronoun clitic is 
not possible: *n dpoper pas mó tov Ho picto ‘the beautiful our.PossCL city 
of Paris’. 

The syntactic patterns for Serbian names comprise combinations of adjectival 
modifiers in pre-nominal position (Americka filmska AKADEMIJA ‘American 
Film ACADEMY’); sometimes they can be found in post-nominal position (EPISKOP 
niški Irinej "Bishop of-Nis Irinej’); prepositional complements in post-nominal 
position (AZOTARA u Pančevu ‘Fertilizer PLANT in Pančevo’); complements in 
genitive (alternating with a prepositional phrase) - Dom zdravlja ‘House (of)- 
Health.Gen (Community Health Center)’ and DIREKCIJA za upravljanje oduze- 
tom imovinom ‘DIRECTORATE for Management of Seized Assets’. 

Coordinated phrases are possible in all languages, e.g., Bulgarian: VICEPREMI- 
ERÁT i MINISTÄR na obrazovanieto i naukata, Meglena Kuneva 'Deputy PRIME 
MINISTER -THE and MINISTER of Education-the and Science-the, Meglena Ku- 
neva'; French: Martin Vetterli, PROFESSEUR à l'Ecole polytechnique fédérale de 
Lausanne et PRÉSI- DENT du Conseil national de la recherche ‘Martin Vetterli, PRo- 
FESSOR at the Federal Polytechnic School of Lausanne and PRESIDENT of the Na- 
tional Research Council’; Greek: o zpo0vzovpyóc kot mpdedpoc tov ZYPIZA 
AAé&ng Toinpog ‘the PRIME MINISTER and HEAD of Syriza Alexis Tsipras’; Ser- 
bian: Hleb i kifle ‘Bread and Rolls’ (an organization name). Both triggers and 
proper names can appear in a coordinated construction (we do not encode coor- 
dination in the syntactic patterns). 


7.2 Definiteness 


Definiteness is expressed either by a morpheme as in Bulgarian and Serbian, or by 
an article as in English, French, and Greek. In English, French, and Greek, the def- 
inite article precedes the trigger, e.g., le PREMIER MINISTRE, Justin Trudeau “the 
PRIME MINISTER, Justin Trudeau. There are other means to express definiteness — 
ie. the demonstrative pronouns, the possessive pronouns in English, French, and 
Serbian, e.g., French: notre belle viLLE, Paris ‘our beautiful crrv, Paris’; Serbian: 
njeno rodno MESTO Beograd ‘her native city Belgrade’. 

With personal names in English, the definite article is obligatory when it is 
modified by an adjective and/or a PP complement: the great PoET Burns, the 
Scottish POET Burns; the POET from Kosovo, Fahredin Shehu; the AUTHOR of the 
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Concerto, Edvard Grieg. The possessive pronoun, the article and the genitive 
determiner are in complementary distribution in English. 

The definite form in Bulgarian is required when the trigger is modified by 
an adjective or a possessive pronoun (in this case the definite adjective is part 
of the first phrasal constituent: noviyat MINISTÄR Valentin Dimitrov 'new-the 
MINISTER Valentin Dimitrov’, and / or a prepositional phrase; if there are no 
pre-nominal modifiers, the article is on the trigger word: MINISTÄRÄT na finansite 
Valentin Dimitrov ‘MINISTER-THE of finance-the Valentin Dimitrov). 

In Greek, all proper names are preceded by a definite article, e.g., n Mapia ‘the 
Maria’, n AOńvæ ‘the Athens”, o rpwBurovpyós AA&Eng Toinpag ‘the PRIME 
MINISTER Alexis Tsipras’. 

In Serbian, the definite article is not used; furthermore neither possessive pro- 
nouns nor adjectives are obligatory. However, when adjectives precede proper 
names, they are in definite form, e.g., od izvesnog Stevice Miletica vs. *od izvesna 
Stevice Miletića from certain Stevica Miletić’. 


7.3 Distribution of clitics in Bulgarian and Greek 


The possessive pronoun clitics in Bulgarian are right-adjacent to the definite arti- 
cle, e.g., in the second position in the noun phrase (noviyat ni MiNISTÁR Valentin 
Dimitrov 'new-the our.PossCL MINISTER Valentin Dimitrov’ vs. MINISTÄRÄT ni 
na finansite Valentin Dimitrov ‘MINISTER-THE our.PossCL of finance-the Valen- 
tin Dimitrov’). The possessive pronoun clitic in Bulgarian is also right-adjacent 
to an indefinite kinship term if used without an adjectival modifier as in MAYKA 
mi Maria ‘MOTHER my.PossCL Maria’. 

In Bulgarian the interrogative particle li (which is always a clitic) may appear 
after the first definite modifier (if not followed by a possessive pronoun clitic) or 
after the whole NP, as in: noviyat li DIREKTOR Ivanov 'new-the li.QuCL DIRECTOR 
Ivanov” and noviyat DIREKTOR Ivanov li 'new-the DIRECTOR Ivanov li.QuCL)). 
The above-stated rules for the definite article, possessive pronoun clitic and in- 
terrogative particle hold for the multiword names too, with the leftmost adjective 
being part of the proper name itself (Bálgarskata narodna BANKA ‘Bulgarian- 
the National BANK). 

Greek pronoun clitics are right-adjacent to the proper name, e.g., n Mapia 
pov ‘the Maria my.PossCL’. Once there is a trigger followed by the proper name, 
the possessive pronoun clitic is between the trigger and the proper name, e.g., o 
xoOnyntrjc jac Xpnotog TooAüxkng ‘the PROFESSOR our.PossCL Christos Tso- 
lakis’. 
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In Serbian, pronoun clitics can sometimes be used to express possession, as 
in KOMSIJA mi Asan ‘NEIGHBOR I.CL Asan (my neighbor Asan)’. However, these 
constructions are rarely used, being considered rather obsolete and non-standard 
and are therefore not included in the patterns. 


7.4 Expression of semantic and grammatical dependencies 


Prepositions are used to express semantic and grammatical dependencies, such 
as affiliation, domain specification, location in English, Bulgarian, French, Greek, 
and Serbian, and possession in English, Bulgarian, and French. Semantic and 
grammatical dependencies can be signified by cases in Greek and Serbian. In 
English, possession may also be expressed by a clitic —s (marking the genitive 
determiner), and in Bulgarian by the derivational suffix of possessive adjectives. 


7.5 Word order - position of the trigger with respect to the proper 
noun 


In French, Greek and Serbian, word order permutations are common for per- 
sonal names, as the first name and surname(s) can change places: French: Nico- 
las Sarkozy vs. Sarkozy Nicolas; Greek: l'eopyvoc KokkwónouAogc ‘Georgios 
Kokkinopoulos’ vs. KokkıvonovAog Tewpyıog 'Kokkinopoulos Georgios’; Ser- 
bian: Marko Vitas vs. Vitas Marko. In Serbian, a change of the order of the first 
name and the surname(s) of male persons results in a change of the syntactic 
properties, as in the former case both names inflect, while in the latter only the 
first name inflects, e.g., in the genitive Marka Vitasa vs. Vitas Marka. 

In alllanguages, the trigger can appear in pre- or post-nominal position: French: 
le MINISTRE des Finances et des Comptes publics, Michel Sapin “the MINISTER of 
finance and of public accounts, Michel Sapin’, or Michel Sapin, MINISTRE des Fi- 
nances et des Comptes publics ‘Michel Sapin, MINISTER of finance and of public 
accounts’; Greek: o Yxovpyóc Owovojuxóv EvkAcióng ToakaAwrtog “the Mın- 
ISTER of finance Efkleidis Tsakalotos’, or o EukAeións TookoAcotoc, Yroup- 
yós Owovojuxóv “the Efkleidis Tsakalotos, MINISTER of finance’. Some abbrevi- 
ations can appear only before or after the names, as in: Serbian: JP "Srbijasume" 
‘PC (Acronym for Public Company) Srbijasume’ but Takovo d.o.o. “Takovo (a 
place name) d.o.o.. 

In all languages, a complex trigger phrase is often in apposition (when the trig- 
ger appears as an apposition, it is always separated with a comma), e.g., English: 
Chris, the new PROFESSOR of Agriculture and Forestry; French: le MINISTRE des Fi- 
nances et des Comptes publics, Michel Sapin “the MINISTER of finance and of pub- 
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lic accounts, Michel Sapin’, or Michel Sapin, MINISTRE des Finances et des Comptes 
publics ‘Michel Sapin, MINISTER of finance and of public accounts’; Greek: o 
Kaðnyntýs Feo pyog Mrauriviorns ‘the PROFESSOR Georgios Babiniotis' or 
o Teopyios Mzaquuvvotns, Kadnyntng ‘the Georgios Babiniotis, PROFESSOR”. 

In all languages, the head personal name can be specified by more than one 
triggers in a preferred order of appearance: English: DIRECTOR General PROF. 
Smith; Bulgarian: generalniyat DIREKTOR PROF. Smit 'general-the DIRECTOR PROF. 
Smith’; French: DIRECTEUR général Pror. Smith “DIRECTOR General Pror. Smith’; 
Greek: o Tevırög Arevðvvtýs Kadnynrng Zu ‘the General DIRECTOR PROFESSOR 
Smith’; Serbian: Generalni DIREKTOR PROF. Smit "General DIRECTOR PROF. Smith’. 


7.6 Alternations 


In English, genitive determiners may alternate with a possessive prepositional 
phrase. 

The possessive PP in Bulgarian may alternate with a possessive or relational 
adjective in pre-position (STOLICATA na Italiya Rim ‘CAPITAL-THE of Italy Rome’, 
italianskata sroLICA Rim ‘Italian-the CAPITAL Rome”). Alternations of possessive 
pronouns and possessive pronoun clitics in Bulgarian are also observed. A noun 
modifier in pre-nominal position can alternate with a prepositional phrase (ski 
INSTUKTORÄT 'ski INSTRUCTOR-THE' vs. INSTRUKTORÄT po ski 'INSTRUCTOR-THE at 
ski"). 

In Greek, the genitive phrase may alternate with the preposition oe ‘at’ fol- 
lowed by accusative case, e.g., l'eopyvog Mnaustıviotng, Ka8n yn tig tov Have- 
rıornniov Adnvov ‘Georgios Babiniotis, PROFESSOR of the University of Athens’ 
or 'eopyioc Mrauriviorns, Kadnynrng oto Haoveztiotrjjuo A0nvóv ‘Georgios 
Babiniotis, PROFESSOR at the University of Athens’. In this case, the two struc- 
tures may convey a different meaning. The alternation is not possible for all 
proper names, e.g., AAé&nc Toinpac, IIpw8umoupydc trc Eddúdas “Alexis Tsi- 
pras, PRIME MINISTER of the Greece’, "AAé&ng Toinpos, IIpoQ0vnzovpyóc om 
EMáa “Alexis Tsipras, PRIME MINISTER in the Greece’. A location proper name 
describing residency at a continent, a country or a city may alternate as an adjec- 
tive modifier or a PP complement attached to the trigger of a personal name (the 
same is true for English, Bulgarian, and Serbian), e.g., Greek: o IpwBurroupyós 
tns EAAddac ‘the PRIME MINISTER of the Greece’ or o EAAnvac IIpovzovpyóc 
“the Greek PRIME MINISTER”; Serbian: AMBASADA Gréke “EMBASSY of Greece” vs. 
Grčka AMBASADA ‘Greek EmBAssY ' (while in French, the adjective follows the trig- 
ger: le PRESIDENT francais Francois Mitterrand ‘the PRESIDENT of-French.Adj 
Francois Mitterrand"). 
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In French, we may have an alternation of the preposition de ‘of’ with the prepo- 
sition à ‘at’, e.g., Martin Vetterli, PROFESSEUR de l'École polytechnique fédérale de 
Lausanne ‘Martin Vetterli, PROFESSOR of the Federal Polytechnic School of Lau- 
sanne' or Martin Vetterli, PROFESSEUR à l'École polytechnique fédérale de Lau- 
sanne ‘Martin Veterli, PROFESSOR at the Federal Polytechnic School of Lausanne’. 

In Serbian, syntactic alternations are permissible, to some extent, with orga- 
nization names: a complement in the genitive case instead of a PP complement 
(MINISTARSTVO rada i socijalne politike ‘Ministry of Labor and Social Policy” 
instead of a MINISTARSTVO za rad i socijalnu politiku ‘Ministry for Labor and 
Social Policy’). 

The features of the triggers in the five languages are summarized in Table 10. 


8 Conclusion 


The semantic classification and the syntactic patterns of single and multiword 
names in Bulgarian, English, French, Greek, and Serbian, may provide reliable 
data for rule-based Named Entity Recognition (NER). 

Linguistic features and distribution facts are used to identify MWEs in NER 
tasks - both in handcrafted rule-based systems that rely heavily on linguistic 
knowledge, and in machine-learning techniques. In their research on the appli- 
cation of MWEs and NEs in keyphrase extraction, Nagy T. et al. (2011) conclude 
that previously known noun compounds are beneficial in NER, and that identi- 
fied NEs enhance MWE detection, as noun compounds and multiword NEs are 
linguistically similar and sometimes it is not easy to distinguish between the two. 

These arguments are further supported by the tagging practice where both 
compound nouns and multiword NEs are often tagged as nouns, as their lin- 
guistic behaviour is similar to that of single-word nouns (Vincze et al. 2011). Ap- 
proaches such as that of Nagy T. et al. (2011) also use features involving NEs or 
pertaining to NEs (i.e., orthography and semantics of keyphrase candidates; po- 
sitions of a token belonging to a specific NE class, as certain classes of NEs can 
be identified by their position in the beginning, in the middle or at the end ofa 
keyphrase candidate). Galicia-Haro et al. (2004) discuss the (Spanish) composite 
NEs (titles of books, movies, songs, etc.) that are described in terms of syntactic 
and semantic features and of local context and consider discourse features such 
as introductory words, prepositions, redundancy; specific sets of names, etc. 

Rule-based systems usually rely on large-scale lexical resources and grammars, 
often in the form of regular expressions or Finite State Transducers (Savary & 
Piskorski 2011; Maurel et al. 2011). Much work has been done on rule-based NER 
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Table 10: Comparison of the morphological and syntactic features of 
the five languages. 


e 
Features that concern the trigger > S KE 
adjective in pre-position + + + + + 
adjective in post-position = Š + 2 + 
PP in post-position + + + + + 
genitive phrase in post-position = S = + + 
genitive phrase in pre-position - E + - 
genitive determiner in pre-position + - s - 
definite article * - - * - 
definite morpheme - + - - 
obligatory definiteness with modifier Adj/PP in + + + + 
a sentence 
possessive pronoun clitic - + S + - 
dependencies expressed by prepositions + + + + + 
dependencies expressed by cases - - - + + 
analytically expressed dependencies + + + + + 
genitive determiner and PP alternation + - - - - 
genitive and PP alternation - - - + + 
poss. pronoun and possessive clitic alternation - + - - - 
noun modifier and PP alternation + + - + + 
Features that concern the whole NE 
pre-position of the trigger + + + + + 
apposition 
interrogative clitic - + - - - 
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for the five languages discussed in this paper, although machine learning meth- 
ods prevail. A set of general NER rules with reasonable accuracy has been de- 
veloped for rule-based annotation of NEs in Bulgarian (Karagiozov et al. 2012), 
French (Maurel et al. 2011), Greek (Farmakiotou et al. 2000), and Serbian (Krstev 
et al. 2013). Vitas et al. (2007) discuss semantic and morphological (derivational 
and inflectional) properties of proper names in Serbian (plus French and English) 
taking into account the significance of regular derivation and the properties and 
function of possessive and relational adjectives produced from proper names. 
Koeva & Dimitrova (2015) discuss a strategy for a linguistic description and clas- 
sification of Bulgarian NEs referring to persons, and their application in several 
resources (lexicons and an annotated corpus) for the definition and evaluation 
of a set of NER rules. 

The syntactic patterns presented in this paper are formulated as rules com- 
prising morphological characteristics and syntactic dependencies related to the 
semantic properties of personal, location and organization NEs in Bulgarian, En- 
glish, French, Greek, and Serbian. We intend to further exploit the formally en- 
coded linguistic information in rule-based NER approaches. Moreover, as the syn- 
tactic patterns for different languages are linked to the same semantic pattern, 
they can be considered equivalent at the conceptual level and may be applied to 
any task that involves multilingual processing: cross-lingual information extrac- 
tion and text classification, multilingual summarization and machine translation. 
Last but not least, the presented approach contributes to comparative language 
studies and may be further extended to other word classes that show relatively 
regular morphological properties and syntactic dependencies. 


Abbreviations 

ADJ adjective GENP (noun) phrase in 

ADV adverb genitive 

AFF affiliation INDEFTRIGGER indefinite trigger 

DEFART definite article INS instrumental 

DEFADJ definite adjective LIT. literal translation 

DEFPOSSPRON definite possessive LOC location 
pronoun MWE multiword 

DEFTRIGGER definite trigger expression 

(DOM)SPEC (domain) MWLOCN multiword location 
specification name 

GENDET genitive determiner MWORGN multiword 


organization name 
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NE named entity POSSADJ possessive adjective 

QUCL interrogative clitic (Poss)cL (possessive) pronoun clitic 

NER named entity recognition POSSPRON possessive pronoun 

PERN personal name PP prepositional phrase 
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The work presented in this paper is aimed at studying predicates that pertain to 
the semantic field of emotions, the focus being on Modern Greek verbal multiword 
expressions (verbal MWEs) and their counterparts in French. A core lexicon of ver- 
bal MWEs denoting emotion was extracted from existing Modern Greek lexical 
resources; the initial list was further extended and revised manually in view of cor- 
pus evidence. A classification of MWEs is proposed based on syntactic, selectional 
and semantic properties; an attempt to map the expressions identified onto their 
French counterparts was also made. The cross-linguistic study reveals similarities 
and discrepancies in the two languages, and highlights the interaction between 
MWES structure and their underlying semantics, in that the intensity of the emo- 
tion denoted and the degree of fixedness of the relevant expressions seem to be 


highly correlated in both languages. 


1 Introduction 


The availability of user-generated content over the web and the increasing need 
to make the most out of it has brought about a shift of interest from factual in- 
formation to the identification of subjective information (as opposed to facts) 
expressed by people or groups of people with respect to a specific topic. To this 
end, the task of determining the so-called private states (that is, beliefs, feelings, 
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and speculations) expressed in running text and the entities involved has been 
the focus of attention in the field of Natural Language Processing (NLP). There- 
fore, identification of expressions denoting emotion or emotional state in textual 
data and their classification is of paramount importance. In this respect, MWEs 
can hardly be overlooked since they constitute a significant proportion of the 
emotion lexicon. 

We hereby present work aimed at treating verbal multi-word predicates that 
pertain to the semantic field of emotions from a cross-lingual perspective and sys- 
tematising their lexical, syntactic and semantic properties. In this context, verbal 
MWES in Modern Greek denoting emotion or emotional state were selected from 
existing language resources. Their lexico-semantic properties were also retrieved 
from these resources and new entries were encoded following the same princi- 
ples. All MWEs were further assigned semantic features inherent to the semantic 
field. At the next stage, their mapping onto their counterparts in French was 
performed. The comparative study of Greek and French MWEs resulted in the 
identification of cross-lingual similarities and discrepancies. Moreover, correla- 
tions between lexical features and the underlying semantics of MWEs were also 
revealed. Our working hypothesis was that despite idiosyncrasies, MWEs that be- 
long to a given semantic class share features that are characteristic for this class; 
moreover, these field-specific features are attested cross-linguistically. One step 
further, the (cross-lingual) treatment of MWEs might be useful not only from a 
purely linguistic point of view but also for NLP applications. 

The paper is outlined as follows. An overview of background work on the study 
of the emotion lexicon and of MWEs is presented in 82, $3 outlines the method- 
ological framework adopted, whereas the selection process of the lexical data is 
described in $4. The lexicon of emotion MWEs and the syntactic, selectional and 
semantic properties encoded are presented in $5; we discuss our findings in $6 
and elaborate further on cross-lingual considerations in $7. Finally, our conclu- 
sions and prospects for future research are outlined in $8. 


2 Background work 


The seminal work at the syntax-semantics interface by Levin (1993) involves 
large-scale classification of English verbal predicates on the basis of shared mean- 
ing and syntactic properties. In this work, more than 3000 verbs were grouped 
into semantically coherent verb classes, each depicting a syntactic configura- 
tion that reflects verb meaning. A more fine-grained semantic classification of 
French verb and noun predicates denoting feeling, emotion and psychological 
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states has also been performed (Mathieu 1999; 2005), aimed at a wide range of 
NLP applications. French nominal and verbal predicates denoting emotion and 
their lexicalised word combinations have been studied (Leeman 1991; Gross 1995; 
Balibar-Mrabti 1995; Tutin et al. 2006) from a different point of view. Finally, a 
comparative analysis of English and French single-word verbal predicates denot- 
ing emotion (Mathieu & Fellbaum 2010) reports on properties shared among the 
two languages on the grounds of syntax and semantics, unveiling at the same 
time the idiosyncrasies of each language. 

As far as MWEs are concerned, a systematic treatment of French fixed expres- 
sions has been carried out (Gross 1982). In this work, the classification and the 
analysis of c. 20000 French verbal MWEs consists of the formal representation 
of their syntactic properties, selectional restrictions and the distinction between 
fixed and non-fixed constituents. Along the same lines, the classification of Greek 
fixed expressions (c. 6000 entries) has been performed based on the same formal 
principles and criteria (Fotopoulou 1993b; Mini 2009). 

The present study is part of a larger effort aimed at developing lexical resources 
that encompass the Greek emotion lexicon, i.e, words and phrases that refer 
to emotional states and emotion-related mental events. Previous work involves 
treatment of nouns and verbs. In this context, 130 Greek noun predicates denot- 
ing emotion (Nsent) were identified and classified on the basis of the verbs' syn- 
tactic, semantic and distributional properties (Pantazara et al. 2008; Fotopoulou 
et al. 2009). In this context, support verbs (Vsup) and other verbs expressing di- 
verse modalities (aspect, intensity, control, etc.) were identified and encoded as 
properties; these properties reveal the restrictions nouns impose on the lexical 
choice of verbs. Similarly, 339 Greek verbal predicates denoting emotion (Vsent) 
were classified into homogenous syntactico-semantic classes based on their syn- 
tactic, lexical and semantic properties (Giouli & Fotopoulou 2012); a number of 
syntactic features (i.e., argument structure, alternations), selectional restrictions 
imposed on the verbs' subject and object complements, emotion type, polarity 
and intensity were also defined and encoded formally. 

In this respect, this work is further aimed at enriching the set of lexical re- 
sources pertaining to the semantic field of emotions with a lexicon that com- 
prises verbal MWEs denoting emotion or emotional state. Moreover, the Greek 
MWEs were mapped onto their French counterparts. The ultimate goal was not 
only to develop a bi-lingual lexical resource, but also to test the hypothesis that, 
despite the idiosyncrasies that are inherent to MWEs in general, a certain degree 
of regularity (in terms of inherent properties) can be observed within a semantic 
class. To this end, we opted for reusing and extending existing lexical resources 
that encompass verb MWEs in Greek and French. 
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3 Methodological framework 


The resources that form the basis of the present study have been developed us- 
ing the Lexicon-Grammar (LG) methodological framework (Gross 1975). Being a 
model of syntax limited to the elementary sentences of a natural language, the 
theory argues that the unit of meaning is not located at the level of the word, 
but at the level of sentence of the form Subject — Verb — Object. Therefore, the 
elementary sentence is transformed to its predicate-argument structure, and the 
main complements (subject, one or more objects) are separated from other com- 
plements (adjuncts) on the basis of formal criteria. Distributional properties as- 
sociated with words, i.e., types of prepositions, semantic features inherent to 
nouns in subject and object positions, etc. are also taken into account, resulting 
in a more fine-grained classification and in the creation of homogeneous word 
classes. Finally, transformation rules, construed as equivalence relations between 
sentences, generate additional equivalent structures. All this information (argu- 
ment structure, distributional properties and permitted transformational rules) 
is formally encoded in the so-called LG tables. 

Each table is defined by a set of distinct properties (syntactic, distributional, 
and semantic) and includes all the lexical items sharing these properties. Predi- 
cates with more than one usage or meaning are treated as separate lexical items 
possibly represented in different tables, and the syntactic and semantic proper- 
ties are assigned to each entry as appropriate. In this sense, entries in one table 
are considered to form a homogeneous class. In an LG table, the set of properties 
that describe the entries are encoded as headers of the columns, whereas entries 
are listed at separate rows. At the intersection of a row corresponding to a lexical 
item (entry) and a column corresponding to a property, the cell is set to ‘+’ if the 
property is valid for the given entry or '—' if it is not. 

Similarly, MWEs are also treated as elementary sentences for which all pos- 
sible fixed and non-fixed (or variable) arguments (if any) are consistently and 
uniformly encoded. The formalism provides the mechanism for encoding prop- 
erties that are appropriate for the identification and processing of MWEs. More 
precisely, the MWE structure is represented as a Part-of-Speech sequence. Ac- 
cording to the LG notation, N denotes a non-fixed nominal, whereas, C signifies 
a fixed one; numbers are used to represent the syntactic function of fixed or non- 
fixed constituents. In this sense, N0 is used to represent a non-fixed noun in sub- 
ject position whereas, C0 denotes a fixed subject. Similarly, N1, N2, N3, etc., along 
with C1, C2, C3 etc. denote complements in object position (or complements of 
prepositional phrases), marked also for fixedness. It should be noted, however, 
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that the internal structure of the noun phrase is not represented explicitly in 
general; patterns depict the elementary sentence or structure characterising each 
MWE class, whereas information regarding modifiers, determiners, etc. allowed 
for by certain expressions is provided in the form of features or properties. Se- 
lectional restrictions over the non-fixed or variable elements of MWEs as well as 
syntactic phenomena (e.g., passive alternation, etc.) - if any - are also encoded 
formally. Finally, other grammatical phenomena such as agreement features are 
accounted for. 

For example, the MWE in (1) below comprises two fixed (or lexicalised) el- 
ements, a verb and a noun in subject position, and two variable elements, namely 
a nominal phrase in accusative and a possessive pronoun (Poss) that modifies the 
fixed nominal constituent. The variant nominal phrase is most often realised as 
a weak personal pronoun in pre-verbal position (Ppv); agreement in number and 
person between the two variable elements is mandatory: 


(1) my devils catch me ‘to become very angry” 
ue TIAVOVV Ta diadAia uou / *cov / *rov Tidvvn 
me pianun ta Oiaolia mu/*su /*tu Tiani 
me catch.3PL the devils.NOM.PL.POSS my / your / the John.GEN 


“to become very angry” 


In this case, a generic syntactic pattern like the one depicted in (2) below is 
used to describe a class in a LG table. 


(2 a. Ppv VCO Poss 
b. Ppv-1 V CO Poss-1 


The agreement attested between variable elements is then depicted via co- 
indexing as shown in (2b). 

An example of MWE representation within the LG framework is illustrated 
in Table 1; the table comprises verbal MWEs with the underlying structure NO V 
Prep C1 (Fotopoulou 1993b). 

It becomes evident, therefore, that the LG framework together with the re- 
quirement of substantial coverage leads to a uniform and consistent description 
of elementary sentences and the formal encoding of properties across languages 
in a comparable manner. In this respect, one of the main advantages of LG is 
that it allows comparisons between languages and facilitates the construction of 
cross-language resources. 
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Table 1: LG table of verbal MWEs (sample). 


N 

O 

a 

E 

& 
iz umi 
Lom Z B x 
ii (E) ip SE 
© © mi © 
Z 2 Z2 2 © 
- + axtiwoporA® amd E EVTUX LO E - + - = 
+ - Appia ano (E+tn) Atvooa (E+Poss-0) = = - 
+ - ppéto op TO Covpt Poss-0 E MEL. 
+ - yew pe tnv Kapói& Poss-0 + + - = 
+ - EPXOHAL o TO Aug Poss-0 NJ 
+ - EPXOHAL o TO OVvYKAAG Poss-0 - - + - 
+ - káéðopar op Ta apyó Poss-0 SMS E 
+ - káéðopar op Ta ayKadıc E EEE om 
+ - kitpivilo and TOV Yößo Poss-0 - + - = 
+ - Ab$vonat op Ta yéhia E sh em. eh oe 


4 Data selection 


The initial list of Greek and French MWEs that pertain to the semantic field 
of emotions was manually compiled from data listed in existing LG tables for 
Greek (Fotopoulou 1993b; Mini 2009) and French (Gross 1982). The selection of 
the Greek MWEs was performed as a two-stage procedure: (a) manual identifica- 
tion of candidate MWEs that pertain to the semantic field emotion, and (b) vali- 
dation of these candidate MWEs for inclusion or deletion on the basis of formal 
criteria besides intuitive judgments. The initial list of MWEs was further updated 
and extended, drawing on corpus evidence. More precisely, Greek MWEs were 
selected manually from a suite of specialised corpora (Giouli & Fotopoulou 2014) 
that were developed and annotated in view of guiding sentiment analysis. In this 
sense, our work is corpus-based and thus empirical rather than purely intuitive. 

Since the scope of the current work is limited to clear instances of emotion 
denoting predicates (i.e., verbal MWEs), a formal distinction between direct and 
indirect affective expressions that correspond to emotion concepts was in order. 
For this reason, a set of lexical semantic tests (lexical substitution, paraphras- 
ing, etc.) was adopted as a formal device guiding the selection of Greek verbal 
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emotion predicates. Therefore, a candidate MWE is selected for inclusion in the 
lexicon if at least one of the following criteria is met: 

Criterion 1: A candidate Emotion MWE is selected if it can be replaced by a 
sequence that comprises one of the verbs feel or cause and a noun that denotes 
emotion (Nsent), that is, if there exists an Nsent that is related with the concept 
EMOTION via the 1s-A relation, and the relation MWE is semantically equivalent 
to "feel/cause Nsent" is true. For example, the expression in (3) is semantically 
equivalent to an expression of the form to feel EMOTION, where EMOTION is panic: 


(3) ue midver mavikóg 
me piani panikos 
me catches panic.NOM 


€ $ 
to panic 


Criterion 2: A candidate Emotion MWE is selected if it can be replaced by a 
verb predicate that denotes emotion (Vsent), that is, if there exists a Vsent defined 
as a conceptualization of a FEEL-EMOTION Or CAUSE-FEEL-EMOTION event and the 
relation MWE is semantically equivalent to “Vsent” is true. For example, the ex- 
pression in (4) is semantically equivalent with the Vsent poßayaı (fovame) “to be 
frightened’: 


(4 ráyooe To «ía lov 
payose to ema mu 
froze the blood.Nom my 


‘I was terrified’ 


Criterion 3: A candidate Emotion MWE is selected if it can be replaced by the 
verb to be and an adjective that denotes emotion (Asent), that is, if there exists an 
Asent defined as conceptualizing an EXPERIENCER-EMOTION Or TRIGGER-EMOTION 
entity, and the relation MWE is semantically equivalent to “to be Asent” is true. In 
the example (5) below, the expression is semantically equivalent to an expression 
of the form to be Asent - stuet éxminkroc (ime ekpliktos) “to be surprised’: 


(5) gévo pe TO OTÓUA AVOLKTÓ 
meno me to stoma anikto 
stay with the mouth open 


“to be aghast’ 


Finally, the selection of French MWEs denoting emotion and their mapping 
onto their Greek counterparts was performed manually. First, translations or 
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translational equivalents of the Greek MWEs were either provided by human 
translators or extracted from standard mono- and bilingual lexicographic 
resources, such as the Trésor de la Langue Francaise Informatisé! and WordRef- 
erence.com. In certain cases, translations were obtained using English as a pivot 
language. These translations were checked against entries in existing LG tables 
that define the typologies of French MWESs (Gross 1982). Once an expression was 
spotted, it was selected and aligned to its Greek counterpart(s). 

The afore-mentioned process resulted in the identification of 607 Greek and 
520 French MWEs that constitute the linguistic data of the current study. As one 
might expect, the numbers show that there is no 1:1 correspondence between 
Greek and French MWEs denoting emotion. In fact, the process of translating 
the list of Greek MWES to the target language proved that the transition from 
one language to the other was not always straightforward. The outcome of this 
procedure can be summed as follows (see also $6.2): 


e a Greek MWE is mapped onto a French MWE; 
e more than one Greek MWEs are mapped onto a single French MWE; 
e a single Greek MWE corresponds to more than one French MWEs; 


e one or more Greek MWES correspond to a single-word French verb rather 
than an MWE. 


5 Description of the MWEs Emotion Lexicon 


Data encoding was performed after data selection. The challenge of represent- 
ing MWEs in lexical resources is to ensure that the variability along with ex- 
tra features required by the different types of MWEs can be captured efficiently 
(Calzolari et al. 2002; Copestake et al. 2002). To this end, features and properties 
that are appropriate for the robust computational treatment of MWEs were re- 
tained from existing LG tables where applicable. MWEs extracted from corpora 
were encoded from scratch. Syntactic information includes the argument struc- 
ture of the elementary sentence (by also depicting fixed and variable elements), 
modification information (if permitted), syntactic alternations, and selectional re- 
strictions imposed over the variable elements of the MWE (often in subject and 
object(s) position). Additionally, all MWEs were coupled with information about 


"The resources are available online (http://atilf.atilf.fr/tlf.htm; http://www.wordreference. 
com/). 
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their type in terms of compositionality, syntactic rigidity idiosyncrasies, and lex- 
ical choice. Moreover, semantic features that are relevant to the semantic field to 
which each of these predicates adheres are also encoded, namely: emotion type, 
polarity and intensity. In this way, the typologies of emotion MWEs in Greek 
and French were consolidated and cross-lingual analogies or discrepancies were 
identified. In the remainder, we will elaborate further on the encoding of verb 
MWESs. As we have already mentioned above, linguistic information is encoded 
formally in both the Greek and French tables, and this common representation 
facilitates the extraction of shared patterns - if any. 


5.1 Emotion MWEs: fixed expressions — SVCs 


In this section, we present the classification of verbal MWEs included in the emo- 
tion lexicon. Entries were assigned a value corresponding to the type they belong 
to, namely (a) fixed (or idiomatic) expressions and (b) support (or light) verb con- 
structions (SVCs). 

The identification of fixed expressions involves lexical, morphosyntactic and 
semantic criteria (Gross 1982; 1998b; Lamiroy 2003), to be taken into account, 
namely: non-compositionality? i.e., the meaning of the expression cannot be com- 
puted from the meanings of its constituents; non-substitutability, i.e., at least one 
of the expression constituents does not enter in alternations at the paradigmatic 
axis; and non-modifiability, in that they enter in syntactically rigid structures, 
posing further constraints over modification, transformations, etc. To this end, 
linguistic tests were applied to all MWEs. The examples that follow conform to 
the criteria mentioned and are classified as fixed expressions: 


(6) Sayxoóvo m Aayapiva 
dagono ti lamarina 
bite the panel.Acc 


“to be in love’ 


(7) serrer les dents 
to.clench the teeth 


'to grit one's teeth/to be stressed or angry' 


^We distinguish between composability/decomposability (Nunberg et al. 1994: 496) and 
compositionality/non-compositionality. Composability concerns the property of phrase el- 
ements to "[c]arry identifiable parts of the idiomatic meaning". 
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On the other hand, identification of SVCs for inclusion in the emotion lexicon 
is based on the following criteria: 

SVCs Criterion 1: SVCs comprise a support verb (Vsup) and a predicative noun 
denoting emotion (Nsent); support or light verbs of this type bear no meaning and 
are simply carriers of tense and person; 

SVCs Criterion 2: SVCs comprise specific (modal) verbs expressing diverse mod- 
alities (aspect, intensity, control, etc.) and an Nsent. These verbs are considered 
as Vsup variants. 

In this respect, SVCs are - to some extent — characterised by semantic trans- 
parency due to the fact that the predicative noun, which carries the predica- 
tive function within the SVC, is used in one of its literal senses. Basic support 
verbs are éyo (exo)/avoir ‘to have’, stuet Prep (ime Prep)/étre Prep ‘be Prep’, kava 
(kano)/faire ‘to make’, the operator verb divw (Sino)/donner ‘to give’, and the cau- 
sative verbs zipokaAó (prokalo)/défier, provoquer ‘to cause”, zpo£evo (prokseno)/ 
provoquer ‘to cause’, agrve (afino)/laisser ‘to leave’, which have an effect on 
structures with the basic Vsup. In practice, however, SVCs are highly idiosyn- 
cratic and for this reason, it is quite difficult to predict which Vsup combines 
with a noun (Abeille 1988). In the case of emotion MWEs, a close inspection of 
the data, showed that domain-specific verbs assume the function of a basic Vsup. 
Greek SVCs in this semantic field usually select for the verbs vicó0c (nio00) “to 
feel’ or a10ávoya: (esdanome) ‘to feel’ (see (8)); similarly, their French counter- 
parts select for the verbs éprouver “to feel’ and ressentir “to feel’, as shown in the 
example (9) below. These constructions are semantically equivalent with single- 
word verb predicates denoting emotion. 


(8) vicc kond 
nio0o yara 
feel joy.Acc 


“to feel joy’ 


(9) ressentir dela joie 
tofeel ofthe joy 


“to feel joy’ 


Additionally, certain verbs selected by the Nsent predicates that function as 
Vsup variants may further denote the degree or intensity of the emotion. From 
a cross-linguistic perspective, these Vsup variants usually form a pair of trans- 
lational equivalents in Greek and French as shown in the examples (10) and (11) 
respectively: 
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(10) reraw ano kond / ty xopá ov 
petao apo yara /ti xara mu 
fly  fromjoy /thejoy my 
'to be very happy? 


(11) sauter de joie 
to.jump of joy 
'to be very happy' 


Classification of MWEs as fixed expressions or SVCs is not always straightfor- 
ward or clear-cut, as shown in 85.2.2 and $6.1. In fact, some expressions seem to 
comprise an intermediate class placed in between fixed expressions and SVCs. In 
other words, there seems to be a continuum between fixed expressions and SVCs 
(or between fixed and free expressions in other cases). These expressions may be 
considered (under syntactic and semantic conditions) as semi-fixed. A study of 
these expressions related to the degree of fixedness is currently in progress (Con- 
stant & Fotopoulou 2016). 


5.2 Syntactic properties 


Syntactic (and semantic) information is extracted from the LG tables for those 
MWES that were accounted for in the past; new MWEs selected for the purposes 
of the current study were encoded as appropriate. Syntactic information in the 
LG tables comprises the argument structure of each MWE, the syntactic alterna- 
tions defined for the particular MWE, and selectional restrictions imposed over 
the variable elements of the expressions. The encoding of modifiability specifi- 
cally concerns the fixed modifiers of SVCs. In the next sections, we elaborate on 
these aspects. 


5.2.1 Argument Structure 


Verbal MWE expressions (fixed non-compositional and SVCs) that denote an 
emotion bear no syntactic idiomaticity, since they generally conform to the ar- 
gument structure of the main verb and there is nothing exceptional in their syn- 
tactic behavior. This information is only implicitly encoded in the LG tables. In 
this respect, naming conventions of the initial tables correspond to specific con- 
figurations cross-linguistically, and this information can be easily and effectively 
retained in the current lexical resource. Information with respect to the underly- 
ing structure and the syntactic function of the (fixed and variable) constituent(s) 
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further shows that verbal MWE predicates conform to the following patterns: (i) 
fixed subject MWEs, (ii) fixed complement MWEs, and (iii) any combination of 
the above. These types are presented in detail in the following paragraphs. 
Fixed Subject MWEs comprise a verb and an NP in subject position; these are 
both lexicalised. Complements (if any) are represented as variant elements. Ac- 
cording the LG notation, the generic syntactic pattern that describes MWEs of 
this type is CO V Q. The symbol Q? is used to denote one or more complements 
a predicate subcategorises for, without further specifying their form. In the LG 
tables, however, the form and function of variable elements are further encoded. 
For example, the patterns C0 V N1 and C0 V Prep C1 N2gen, used to describe Greek 
and French expressions in (12) and (13) below, further license a variable nominal 
phrase in object position or as the complement of a PP modifier respectively: 


(12) cold sweat bathes me ‘I am terrified’ 
Kpvog ıöpwtag EAovoe mv Avva. 
krios iörotas eluse tin Ana 
cold sweat.NOM.sB] bathed the Anna.ACC.OBJ 


“Anna was terrified. 


(13) La haine niche dans le coeur de Anna. 
the hate.sB] nestsin the heart of Anna 


“Anna hates. 


It should be noted, however, that the variable complement is usually employed 
in its cliticised form as shown in (14); this property is also encoded in the LG 
tables. 


(14) cold sweat baths me ‘I am terrified’ 
Tyv  EAovoe kpúoç paras. 
tin eluse krios iörotas 
ber opt bathed cold sweat.sBJ 


‘She was terrified. 


Similarly, Greek SVCs may comprise an aspectual variant of a Vsup and a pred- 
icative noun denoting an emotion in subject position: 


> We will not discuss the possible forms assumed by Q in detail. 
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(15) ue máve mavikóg 
me piani panikos 
me catches panic.NOM.SBJ 


€ LANES] 
to panic 


Fixed Complement MWEs. Verbal MWEs of this type comprise a verb and one 
lexicalised complement. Most often, this lexicalised complement is an NP in di- 
rect object position. The subject is represented as a variable argument of the 
elementary sentence; the generic syntactic pattern that describes fixed verbal 
MWES of this type is NO V CL whereas the syntactic pattern of SVCs is NO Vsup 
Nsent: 


(16) Sayxovo ty Aauapiva 
dagono ti lamarina 
bite the panel.oBJ 


“to be in love’ 


(17) avoir du chagrin 
to.have of grief 


“to be sad’ 


Fixed PP Complement MWEs comprise a verb and a lexicalised prepositional 
phrase (PP) complement. The variable NP in subject position along with other 
non-fixed elements (ifany) is also represented as appropriate. The generic pattern 
that describes this class is of the form NO V Prep C1. In (18), the Greek MWE 
consists of the verb káĝopaı (kadome) “to sit’ and the lexicalised PP ota kon 
(sta karfia) ‘on the nails’. Similarly, the French MWE in (19) consists of the verb 
rire ‘to laugh’ and the PP aux larmes ‘to tears’: 


(18) xá0ouai omg ` od 
kadome sta  karfia 
sit to.the nails 


“to be anxious”, to be on tenterhooks' 


(19) rire aux larmes 
to.laugh to.the tears 


'to roar with laughter 


Fixed Adjunct MWEs comprise a verb plus an adjunct (often adverb) that are 
both lexicalised. Other variable complements are depicted in the structure of the 
relative elementary sentence: 
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(20) qépo Papéws 
fero vareos 
carry heavily 


'to be very sad' 


(21) Ils si aiment comme deux tourtereaux 
they REFL love like two lovebirds 


“They are in love. 


Finally, a number of verbal MWEs have a syntactic structure that is a combi- 
nation of the configurations presented. These structures are exhaustively repre- 
sented in the resource: 


(22) pov aveßaiveı To aua 0TO  KEQÓÀI 
mu aneveni to ema sto kiefali 


me.GEN raises the blood.nom to.the head 


“to become very angry” 


(23) la moutarde monte au nez 
the mustard raises to.the nose 


“to become very angry” 


(24) avoir froid dans le dos 
to.have cold in the back 


“to be terrified’ 


5.2.2 Modification 


Fixed non-compositional verbal expressions do not allow for any modification over 
the fixed constituents. On the contrary, SVCs are considered as syntactically 
more flexible constructions, and adjectival modification is allowed over the Nsent. 
However, constructions with a Vsup do not conform to a uniform pattern of mod- 
ification (Moustaki et al. 2008). Adjectival modification within the MWE is found 
to be free, semi-fixed or even fixed. Modification in both languages involves inten- 
sifiers or - more generally - grade indicators like ueyáAog (meyalos)/grand ‘big’, 
Aiyog (liyos)/petit few”, poßepog (foveros)/ intense ‘awful’, &kparog (akratos)/ in- 
tense ‘awful’, etc.: 
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(25) O Tome viodeı Eva ra0oÀoyikó | / vnapkıako / adpioto/ *Öduvaro doc, 
o Tianis nio0i ena padoloyiko /iparksiako / aoristo /dinato anyos 
theJohn feels a  pathological/existential/vague /strong anxiety 


‘John feels a pathological / existential / vague / “strong anxiety: 


(26) Jean éprouve une angoisse pathologique / vague / sourde / mortelle / de 
John feels an anxiety pathological / vague / silent / deadly / of 
mort / existentielle. 
death / existential 


“John feels a pathological / vague / silent / deadly / existential anxiety: 


(27) Me émaoe Hoon aneAnıoia / "Arr. 
me epiase mavriapelpisia /lipi 
me cought black dispair.vOM / sorrow.NOM 


‘I was in total despair. 


(28) Fai eu une peur bleue / "tristesse bleue. 
I havehada fear blue / sadness blue 


‘I was terrified’ 


The fixed modifiers, i.e., modifiers that seem to be idiosyncratic to a given 
Nsent cannot be employed productively. We note that in example (27) , the ad- 
jective patvpn, mavri, ‘black’ is only used as a modifier of the nominal predicate 
ameAmioia, apelpisia, ‘despair’, which cannot be described literally as being of 
black colour. Similarly, the French adjective bleu ‘blue’ in (28) is only used with 
the nominal predicates peur 'fear'. These expressions are also encoded as fixed in 
the LG tables. Actually, this is evidence of the existence of grey zones between 
SVCs and fixed expressions (cf. 85.1). 

To conclude, Greek and French Nsent predicates in a SVC select from a vari- 
ety of modifiers in an idiosyncratic manner. Moreover, the respective Greek and 
French expressions seem to present a variable degree of fixedness depending on 
the Nsent and the modifier selected. Free and semi-fixed modifiers are not en- 
coded in the lexicon so far. On the contrary, fixed modifiers of the predicative 
noun are encoded as fixed elements of the expression. 


5.2.3 Syntactic alternations 


Information relative to syntactic alternations encoded in the LG tables was also 
kept in the lexical resource. The causative-inchoative alternation is a syntactic 
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property that involves verbs (or pairs of verbs) which have an intransitive and 
a transitive usage. The inchoative form (intransitive) denotes a change of state, 
and the causative form (transitive) denotes a bringing about of a change of state. 
A number of emotive MWEs were found to enter this alternation. The following 
cases have been attested in the LG tables: 

First case: a pair of two MWEs each one comprising a distinct verb, whereas 
all the other fixed elements are identical. The two verbs (which are often predi- 
cates denoting movement) normally enter (or signal) the transitive-intransitive 
alternation: 


(29) to take one out of one's clothes 'to make someone angry' 
o liívvgg mv Bydlet m Mapia and ta povya me (CAUS) 
o Tianis tin vyazi ti Maria apo ta ruya tis 
the John.sßJ her.oBJ takes.out the Maria.oBJ from the clothes hers 


‘John makes Maria very angry: 


(30) to get out of one's clothes 'to be made angry? 
n Mapia Pynke and tæ povya tye. (INCHO) 
i Maria vyikie apo ta ruya tis 
the Maria.sB] went.out from the clothes hers 


€ H H 
Maria was made very angry: 


(31) to send someone to the seventh sky ‘to make someone happy’ 
Eric envoie Léa au septième ciel. (CAUS) 
Eric sends Lea to.the seventh sky 


'Eric makes Lea very happy: 


(32) to go up to the seventh sky ‘to be happy” 
Léa monte au septième ciel. (INCHO) 
Lea goes-up to.the seventh sky 


“Lea is in the seventh heaven. 


Second case: MWEs that comprise a verb that enters the transitive-intransitive 
alternation (ergativity): 


(33) to turn someone's lights on 'to make someone angry' 
o Tiávvnc pov dkávape Ta Jovi, (CAUS) 
o Tians mu anapse ta labakia 
the John.spj I.GEN turned.on the lights.o5j 


“John made me very angry: 
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(34) my lights turn on T get angry’ 
Mov ávayav ta Aoymákia (INCHO) 
mu anapsan ta labakia 
LGEN turned.on the lights.spy 


‘I got very angry: 


Similarly, other syntactic properties were encoded in the LG tables where ap- 
plicable (i.e., passivisation, genitive-dative alternation, etc.). 


5.2.4 Selectional restrictions 


A number of selectional restrictions that are imposed on the variable elements 
of the MWEs (in subject and object(s) position) were encoded as properties in 
the LG tables. Like their single word counterparts, verbal MWEs denoting emo- 
tion select a nominal element that is obligatorily [+human]. Being at the heart 
of the syntax-semantics interface, this information relates to the participants of 
the emotion event. An emotion event generally involves an EXPERIENCER (that 
is, the individual experiencing the psychological state) and a THEME (that is, the 
content or object of the psychological state) or - occasionally - a Cause. These 
participants, however, are not realised in a uniform way in single word verbal 
predicates. In this respect, the distinction between SubjectExperiencer (SubjExp) 
and ObjectExperiencer (ObjExp) single word verbal predicates has been estab- 
lished (Belletti & Rizzi 1988) based on the syntactic distribution of the verbal ar- 
guments and the associated Semantic Roles. The former project the EXPERIENCER 
of the emotion as their structural subject and the THEME or the STIMULUS as their 
structural object; the latter realise the THEME or the STIMULUS as the subject and 
the EXPERIENCER as their object. This information is of relevance to a number of 
NLP applications, and although it has not been encoded in the LG tables, it can 
be deduced easily. In fact, as it has been shown (Giouli & Fotopoulou 2014) for 
the single-word verbal predicates denoting emotion, the N0 or N1 complements 
with the [+human] restriction can be mapped onto the EXPERIENCER participant 
in the emotion event. 

This is true for MWEs too; here the EXPERIENCER is realised not as a structural 
subject but in object position. In this sense, the non-fixed element that bears the 
semantic restriction [+human] corresponds unambiguously to the EXPERIENCER 
of the emotion. In the following examples, the EXPERIENCER of the emotion is 
expressed by the subject of the Greek and French expressions as shown in (35) 
and (36) respectively, or by the direct object as depicted in (37) and (38) below: 
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(35 H Avva merde ATÓ Hope. 
i Ana petai apo xara 
the Anna.sBJ.EXP flies of joy 
“Anna is very happy: 
(36) Anna rayonne de joie. 
Anna.sBJ.EXP shines of joy 
“Anna is very happy: 
(37) to take one out of one's clothes “to make someone very angry” 
o Tiavvng ue éPyade and ta povya pov. 
o Dianis me evyale apota ruya mu 


the John.sBJ me.OBJ.EXP took.out of the clothes mine 


“John made me very angry: 


(38) Ce film  m' a ému aux larmes. 
this film.sB] me.OBJ.EXP has touched in tears 


"Ihis film moved me to tears: 


Additionally, other selectional restrictions imposed on the variable elements of 
the verbal MWE are encoded. These restrictions further specify the type of com- 
plements (nominal, prepositional, sentential) that these predicates sub-categorise 
for. In this respect, prepositions selected by the MWE predicates are formally de- 
picted and encoded. 


5.3 Semantic classification 


The semantic classification of the studied Greek and French MWEs was aimed at 
grouping them under pre-defined emotional concepts and at distinguishing se- 
mantically between expressions that are near synonyms. This was attempted fol- 
lowing a schema defined for single-word Greek verbs denoting emotion (Giouli 
& Fotopoulou 2012) along three dimensions: (a) emotion type (b) emotion polarity 
(c) emotion intensity and (d) aspect of the emotion event. The semantic classifica- 
tion of verbal MWE predicates was performed separately by two experienced lin- 
guists in the form of primarily intuitive semantic grouping. At the next stage, dis- 
crepancies between the annotations thus obtained were discussed and resolved, 
whereas cases for which no agreement could be consolidated were left aside for 
future treatment. The outcome of this procedure was the definition of specifica- 
tions that would be applicable for distinguishing between semantic classes. 
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Emotion is described as a set of two or more dimensions. The most common 
ones are polarity, i.e., positive or negative connotation of emotion and the inten- 
sity or strength of the emotion. The notion of semantic polarity, or the semantic 
orientation of words (whether they denote a positive or a negative emotion) has 
also been the focus of attention in many studies aimed at sentiment analysis 
(Esuli & Sebastiani 2006; Wilson et al. 2005) inter alia. In our approach, the en- 
coding schema provides for the annotation ofthe a priori polarity of the emotion 
denoted, which subsumes one of the following values: (a) positive, i.e., predicates 
which express a pleasant feeling (b) negative, i.e., predicates which express an 
unpleasant feeling (c) neutral, i.e., predicates that denote an emotion that is nei- 
ther positive not negative and (d) ambiguous, i.e., predicates expressing a feeling, 
the polarity of which is context-dependent (e.g., surprise). 

Polarity identification results in a coarse - yet quite effective — classification of 
emotion expressions; a more fine-grained one was attempted on the basis of emo- 
tion types. Psychological considerations of sentiment claim that some emotions 
are more basic than others, therefore, they should be universal to all human lan- 
guages. The identification of basic emotions is based upon specific functional and 
physiological criteria, yet languages are claimed to possess inventories that com- 
prise a great number of emotion predicates that cannot be easily accommodated 
within such fairly straightforward schemes. To this end, different dimensions of 
emotion can be used to delineate senses. In the work presented here we adopted 
an extended version of the typological model defined by Plutchik (2001). The ini- 
tial model comprises eight basic emotions: anger, fear, sadness, disgust, surprise, 
anticipation, acceptance and joy. On the basis of corpus evidence derived from a 
tri-lingual corpus (English, Greek, Spanish) annotated for sentiment (Giouli et al. 
2013), the initial list of basic emotions was further extended with a set of com- 
plex emotions, such as love and hate or emotions of (self-)appraisal (e.g., shame, 
respect) that were not considered by Plutchik. To better account for the concep- 
tual representation of the emotion vocabulary, the final set of emotion types in- 
cludes 15 new classes, namely: admiration, boredom, disappointment, envy, grat- 
itude, hate, indifference, jealousy, love, relaxedness, remorse, resentment, respect, 
and shame. Greek and French MWEs were assigned an emotion concept; this 
classification results in grouping Greek and French verbal MWEs under emotion 
concepts. 

Moreover, to model the semantic distinction between near synonyms that oc- 
cur within a semantic class such as goßayaı, fovame, ‘to be scared’, xavwofóA- 
Mopar, panikovalome, “to panic”, pov KönnKav ta Hata, mu kopikan ta ipata, 
‘to be very frightened’, etc., entries were further coupled with the feature inten- 
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sity (or strength). The following values are provided for by the schema for the 
feature strength: low, medium, high, and uncertain. In fact, emotion verbal pred- 
icates have been shown to possess scalar qualities (Fellbaum & Mathieu 2012). 
In this respect, groups of verbs that were assigned the same emotion type were 
checked in order to identify different degrees of intensity of the same underly- 
ing emotion. In this respect, intuitive judgments of trained lexicographers were 
systematised and a number of linguistic tests were defined aimed at the consis- 
tent annotation and the ordering of predicates according to the intensity of the 
emotion they denote. 

In both languages, intensity was proved to be dependent on the following as- 
pects: (a) degree of fixedness (b) modifier selected (in SVCs) and (c) the Vsup 
selected. More precisely, the majority of verbal idioms were judged to express an 
emotional state or event of high intensity; these were further marked as not ac- 
cepting any modifier. Similarly, the Vsup of an SVC seemed to have an impact on 
the value assigned to the feature intensity. Ultimately, a number of Vsup function 
as an intensifier of the emotion denoted. In this respect, the verbs éyco (exo)/avoir 
“to have’, viciOw (nio00)/éprouver ‘feel’ and aıodavouaı (esdanome)/ressentir ‘to 
feel’ in Greek and French respectively usually denote an emotion that bears the 
value medium for the feature intensity; on the contrary, when the verbs merde 
(petao) ‘fly’ and rayonner ‘shine’ are employed instead, the entire expression is 
marked as denoting the same emotion, yet with an intensity marked as high. Mod- 
ification of the Greek and French expressions is permitted only when the Vsup 
that evokes a medium intensity of an emotion is employed as shown in (39) and 
(41); when the Vsup denoting an emotional state of high intensity is employed, 
modification is blocked as in (40) and (42): 


(39) H Avva vıodeı yapá / uey&Ar kond. 
i Ana nio0i yara / meyali yara 
the Annafeels joy /big joy 


“Anna is happy / very happy: 

(40) H Avva erger ano yopá / *"uey&Am kond. 
i Ana petai apo yara / *meyali xara 
the Annaflies of joy /big joy 
“Anna is very happy. 


(41) Anna éprouve dela joie / une grande joie. 
Anna feels  ofthejoy/a big joy 


“Anna is happy / very happy: 
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(42) Anna rayonne de joie / *rayonne d' une grande joie. 
Annashines ofjoy /shines ofa big joy 


“Anna is very happy: 


Finally, the encoding schema also provides values for the feature aspect, i.e., 
the perspective taken on the internal temporal organization of the emotion event. 
Different values of aspect distinguish different ways of viewing the internal tem- 
poral constituency of the same event. The schema adopted provides the values in- 
choativeAspect, terminativeAspect, durativeAspect and frequentiveAspect. The en- 
coding at this level, however, has been finalised only for the Greek MWEs. 


6 Discussion 


At the final stage of our study, an examination of the interplay between syntac- 
tic, semantic and lexical features of the studied MWEs was performed. Moreover, 
cross-lingual similarities and differences were identified. As has already been 
mentioned, our working hypothesis was that despite idiosyncrasies, MWEs that 
pertain to a given semantic class share features that are characteristic for this 
class; moreover, these features can be even attested cross-linguistically. As has 
already been mentioned in 85.1 above, MWE identification and classification em- 
ploys lexical and morphosyntactic besides semantic criteria (Gross 1982; 1998a; 
Lamiroy 2003). However, they do not apply in all cases in a uniform way, and the 
variability attested brings about the notion degree of fixedness (Gross 1996). On 
the one hand, fixed expressions bear a meaning that cannot be computed based 
on the meaning of their constituents and the rules used to combine them. SVCs, 
on the other hand, have a rather transparent meaning due to the presence of 
the Nsent which retains its original sense. However, a number of problems are 
posed and the limits between SVCs and verbal fixed expressions (see also 85.1) 
are in some cases fuzzy: despite the semantic transparency entailed by the Nsent, 
the overall structure is often susceptible to a number of constraints as shown in 
examples (43) and (44) below: 


(43) | duwríotgke To mpóowro rov Níkov ano yopá. 
fotistikie to prosopo tu Niku apo yara 
was.lit.up the face.NoM the Nikos by happinesss 


“Nikos” face lit up with happiness: 
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(44) * Bwriornkeo Nikos amo kond. 
fotistikie o Nikos apo yara 
was.lit.up the Nikos.NoM by hapiness 


According to a study on verbal MWEs (Balibar-Mrabti 1995), expressions like 
the one depicted in (43) are defined as semi-fixed ones. In this respect, the verbal 
MWES under study were found to be placed along the continuum fixed, semi- 
fixed and SVCs. Consequently, the class of semi-fixed expressions constitutes a 
grey zone, the intermediate mentioned in 85.1 and 85.2.2. However, in this work, 
we opted for classifying semi-fixed expressions that comprise a predicative noun 
Nsent as SVCs. 

One step further, the correlation between the features non-compositionality/ 
fixedness and the attributes polarity and intensity was examined. Our underlying 
assumption was that the degree of fixedness of the relevant expressions and the 
polarity/intensity of the emotion denoted are highly correlated. In this respect, 
the focus was placed on the values assigned for the feature intensity of the emo- 
tion denoted and their correlation to the aspects of MWE category (i.e., fixed ex- 
pression or SVC). The majority of the considered Greek MWEs, that is 410 expres- 
sions, were attributed the value Negative for the feature Polarity, whereas only 
169 were encoded as Positive and 133 as Neutral. Of these, 97 MWEs denote anger, 
73 denote fear, and 105 denote sadness; 90 expressions were identified as express- 
ing joy and 30 a surprise event. The remaining expressions are distributed across 
the remaining conceptual categories. Another interesting remark concerns ver- 
bal idiomatic non-compositional expressions; most of the expressions (260) that 
have been assigned the value negative for the feature polarity are also encoded 
as being of type fixed (as opposed to 150 expressions classified as SVCs). Addi- 
tionally, fixed expressions were - in most cases - attributed a value high for the 
feature intensity. Of the approximately 300 fixed expressions, 210 are assigned 
the value high for the feature intensity. On the contrary, SVCs in both languages 
do not constitute a uniform class, and the overall emotion intensity denoted de- 
pends largely on the Vsup selected rather than the Nsent itself. Three cases are 
identified: 


« The Vsup is selected by all Nsent predicates; these verbs* adhere to a pro- 
ductive and relatively open paradigmatic axis, and syntactic variability is 
allowed to some extent. In these cases, the intensity ofthe emotion denoted 


“For example, éyw (exo)/avoir “to have’, vicc (nio00)/éprouver “to feel’ and aıodavoyaı (es- 
0anome)/resentir ‘to feel”. 
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is determined on the basis of the semantics of the Nsent; any possible mod- 
ifier functions as an intensifier of the emotion denoted. 


e The Vsup selection is subject to lexical restrictions, and syntactic variability 
is not allowed.? In this case, the Vsup contributes to the intensity and/or 
some aspectual meaning of the emotion denoted. The overall intensity of 
the emotion expression is determined on the basis of the semantics of the 
Nsent, and the Vsup functions as an intensifier. 


e The Vsup selection is extremely limited or unique, and a strong lexical- 
ization is attested; syntactic variability is not allowed and the Vsup is an 
intensive or aspectual variant that has a strong impact on the intensity of 
the emotion denoted: 


(45) pe Tpóan Cydia / *orevoyopıa / "Arm 
metroi i  zilia /stenoyoria / lipi 


me eats the jealousy.NOM / worry.NOM / regret.NOM 


‘to be devoured by jealousy’ 


(46) être rongé parla jalousie 
to.be gnawed by the jealousy 
‘to be devoured by jealousy’ 


7 Cross-lingual considerations 


Research on idioms reported in Villavicencio et al. (2004) shows that there is 
remarkable variation in MWEs across languages. Similar variations are attested 
in the data used in the current research. As one might expect, there is no one-to- 
one correspondence between syntactic patterns in the two languages. It is worth 
looking at SVCs and fixed expressions separately here. 

Greek and French SVCs present a number of similarities in terms of the un- 
derlying syntax and semantics. In some cases, even a direct lexico-syntactic cor- 
respondence is observed for a cross-lingual MWE pair with similar semantics as 
illustrated in (47) and (48) below. Furthermore, semantic transparency in SVCs 
implies more correspondences at least at the level of syntactical patterns - we 
have demonstrated this with examples (8) and (9). As one might expect, differ- 
ences between the Greek and French expressions are limited to basically those 


For example, ævatpıyıáčw (anatriyiazo)/frissoner ‘to shiver’, Ayo (labo)/briller “to shine’, 
Aróvo (liono)/fondre ‘to dissolve’, etc. 
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that exist in general between the two languages, i.e., usage of determiners and the 
indefinite article, case marking for NPs in subject and object position in Greek 
as opposed to PP complements in French, etc. 


(47) to give to the nerves “to cause anger” 
diva ora veúpaæ 
dino sta nevra 
give to.the nerves 


“to cause anger” 


(48) donner sur les nerfs 


to.give on the nerves 
“to cause anger” 


In other cases, Greek and French SVCs share the same syntactic structure and 
underlying semantics, yet their lexical composition is different. The differences 
are attested both in the lexical choice of the Vsup and/or the overall structure of 
the verbal expression. For example, the French verb nager ‘to swim’ seems to be 
more productive than its Greek counterpart 72éw (pleo) ‘to sail’ as shown in (49) 
and (50) below. The latter is only employed in a rather fixed configuration and 
selects only one Nsent, showing, thus, a limited (or even fixed) distribution: 


(49) nager dansle bonheur/ la joie/l' optimisme/l' amour 
toswim in the happiness/ the joy/ the optimism/ the love 


'to be very happy/ happy/ very optimistic/ in love' 


(50) mAéo oe meAdyn evtvyiac/ *ornv euruyia/ — "otrv arciodotía/ "otv 
pleo se pelayi eftixias/ stin eftixia/ stin esioöoksia/ stin 
sail in seas happiness.GEn/ in.the happiness/ in.the optimism/ in.the 
ayann 
ayapi 
love 


“to be very happy/ happy/ optimistic/ full of love’ 


Being conceptual metaphors (usually obsolete), fixed expressions present in 
some cases considerable similarities in both lexical choice and structure cross- 
linguistically. Again, differences are limited to the usage of determiners, argu- 
ment realization, selection of prepositions, etc. Often, the lexicalised nominal 
element (that assumes the function of the direct object) denotes a part of the 
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body (Npc) as exemplified below. These expressions open a slot that is filled by 
a variable noun in genitive case in Greek and a PP complement in French (à N 
“to N’). This element is usually realised as a cliticised pronoun - in both Greek 
(51) and French (52)) - and it designates the beneficiary of the event expressed by 
the predicate (Leclère 1976; Fotopoulou 19932). This genitive (in Greek) and PP 
(in French) is a specific case with semantic and syntactic features; Leclere (1976) 
has offered the term datif étendu) for this genitive: 


(51) pov kófovrau ta Hata “my liver is cut’ 
TOU  KÓTNKQV Ta HATA 
tu kopikan ta ipata 
he.GEN cut the liver.PL.NOM 


‘to be frightened’ 


(52) lui casser les pieds 
him to.break the feet 


€ H H 
to get on one s nerves 


In some cases, similarities are attested in terms of argument structure. For 
example, the Greek verbal expression depicted in (53) and its French counter- 
part shown in (54) are encoded as entries in Greek and French tables. Each table 
features MWEs that share the same properties and lexico-syntactic constraints; 
this means that the resulting tables are to a large extent homogenous. Therefore, 
correspondences between homogenous LG tables in Greek and French can be 
obtained and mappings of MWEs from one language to the other are feasible. 


(53) Pyaívo ano Ta povya pov 
vyieno apo ta ruya mu 
get.out from the clothes mine 


“to be very angry” 


(54) sortir de ses gonds 
to.get.out of one's pumps 


“to be very angry’ 


Additionally, there are many verbal idiomatic expressions which have no di- 
rect or precise equivalent in the other language and they correspond to a single 
word verbal predicate, as shown in the Greek example (55) which is attributed 
the French verb gächer ‘to spoil’: 
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(55) to me he/she/it takes it out sour ‘he/she/it makes it unpleasant to me’ 
rov topydlw Evo 
tu to vyazo  ksino 
he.GEN it take.out sour 


“to make unpleasant" 


Semantically almost equivalent expressions that still present differences in as- 
pectual meaning and/or the intensity of the emotion have been identified in the 
Greek and (to a large extent) in the French data. Sense discrimination and the 
alignment of Greek and French MWEs can be enhanced on the basis of the val- 
ues assigned to those emotion-related attributes: a set of MWESs are classified 
under the same emotion concept, yet sense discrimination is further enhanced 
on the basis of the values assigned to emotion-related attributes. 


8 Conclusions and future research 


MWES pose challenges with respect to their identification, analysis and repre- 
sentation both to linguistic theory and to applications. In this study, we aimed at 
consolidating the typologies of emotion MWESs in Greek and French and at find- 
ing cross-lingual analogies and asymmetries. The syntactic, lexical and seman- 
tic properties of the Greek and French verbal constructions were systematically 
examined, by taking also into account the semantic properties of the semantic 
field, namely the features intensity and polarity of the emotion denoted. We have 
shown that, despite existing idiosyncrasies, in both languages the MWEs in the 
semantic field of emotion share properties. Moreover, syntactic, semantic and 
lexical features of emotion MWEs seem to have an impact on the semantics of 
the expression in terms of emotion-related features. Future work will be oriented 
towards (a) investigating the properties of semi-fixed expressions, taking into ac- 
count the degree of fixedness (b) studying the aspectual variants of SVCs in both 
languages (c) revising the coding used in the emotion Lexicon according to new 
studies and data and (d) populating the lexical resource with new expressions. 
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Abbreviations 
ASENT  ajdective denoting emotion NsENT noun denoting emotion 
CAUS cause Prv pre-verbal position 
EXP experiencer REFL reflexive pronoun 
INCHO inchoative SVC support verb construction 
E zero element VsENT verb denoting emotion 
LG Lexicon-Grammar VsuP support verb 
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This chapter is set in the context of Corpus Pattern Analysis (CPA), a technique 
developed by Patrick Hanks to map meaning onto word patterns found in corpora. 
The main output of CPA is the Pattern Dictionary of English Verbs (PDEV), cur- 
rently describing patterns for over 1,600 verbs, many of which are acknowledged to 
be multiword expressions (MWEs) such as phrasal verbs or idioms. PDEV entries 
are manually produced by lexicographers, based on the analysis of a substantial 
sample of concordance lines from the corpus, so the construction of the resource 
is very time-consuming. The motivation for the work presented in this chapter is 
to speed up the discovery of these word patterns, using methods which can be 
transferred to other languages. This chapter explores the benefits of a detailed con- 
trastive analysis of MWEs found in English and French corpora with a view on 
English-French translation. The comparative analysis is conducted through a case 
study of the pair (bite, mordre), to illustrate both CPA and the application of sta- 
tistical measures for the automatic extraction of MWEs. The approach taken in 
this chapter takes its point of departure from the use of statistics developed ini- 
tially by Church & Hanks (1989). Here we look at statistical measures which have 
not yet been tested for their ability to discover new collocates, but are useful for 
characterizing verbal MWEs already found. In particular we propose measures to 
characterize the mean span, rigidity, diversity, and idiomaticity of a given MWE. 
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1 Introduction: phraseology and Multi-Word Expressions 


Traditionally, people have long believed that each word has one or more mean- 
ings and that these meanings can be selected and put together, as if in a child's 
Lego set, to construct propositions, questions, etc. This belief is still widely (and 
unquestioningly, unthinkingly) held by many NLP (Natural Language Process- 
ing) researchers among others. This may indeed be a good way of accounting for 
basic propositional logic, but it accounts at best for only a very limited subset of 
natural language use. An alternative view is that logics are by-products of natural 
language. At the very least, we may say that the relationship between language 
and logic is not well understood. If the "Lego set" theory of meaning in language 
were tenable, it would not have been necessary for NLP and AI (Artificial Intelli- 
gence) researchers such as Ide & Wilks (2007), after many years of intensive (and 
expensive) effort, to declare that projects in Word Sense Disambiguation (WSD) 
have failed to achieve even their most basic goals. 


At present, WSD work is at a crossroads: systems have hit a reported ceil- 
ing of 70%+ accuracy (Kilgarriff et al. 2004), the source and kinds of sense 
inventories that should be used in WSD work is an issue of continued de- 
bate, and the usefulness of stand-alone WSD systems for current NLP ap- 
plications is questionable. (Ide & Wilks 2007: 15). 


The alternative view mentioned here is supported by lexicographers such as 
Atkins et al. (2001), Kilgarriff et al. (2004), and Hanks (2000). These lexicogra- 
phers argue that much of the meaning of an utterance is carried by underlying 
patterns of co-selection of the words actually used, rather than by simple con- 
catenations. These conclusions overlap to some extent with the tenets of Con- 
struction Grammar, though the methodologies are very different. In corpus lin- 
guistics, Sinclair declared, after a lifetime's empirical research into texts, corpora, 
and meaning, "Many if not most meanings require the presence of more than one 
word for their normal realisation" (Sinclair 1998: 4). 

If these lexicographers and corpus linguists are right, it might appear that 
MWES play a central role in the meaningful use of language. They are not merely 
an irritating set of exceptions, as used to be thought. According to this, MWEs 
are not exceptions to the rule; they are the rule. The exceptions, insofar as they 
exist in normal language use, are isolated meaningful uses of single words. 

It has long been obvious that the meaning of MWESs such as of course, a ball- 
park figure, and spill the beans is not compositional. No courses, ball parks, or 
beans are invoked by someone deconstructing the meaning intended by a speaker 
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who uses these expressions. However, extended analysis of large volumes of data 
leads to the somewhat unwelcome conclusion that the concept of a MWE may 
also be flawed, being nothing more than an attempt to extend the "Lego-set" 
theory to cover some so-called fixed expressions such as spill the beans and kick 
the bucket. Here, the choice of lexical items is fixed: one cannot talk meaning- 
fully, except perhaps in jest, about "tipping over the haricots or *booting the pail. 
However, even in these very fixed MWEs, certain grammatical alternations, in 
particular verb inflections, are normal and unremarkable. 

More to the point is the fact that many other expressions, that at first sight 
might be considered compositional, are associated with a limited phraseology. 
They do not vary freely, but employ selectional variations drawn from within a 
(usually quite small) lexical set. Such patterns are found for many expressions 
that intuitions alone might encourage us to classify as fixed. Corpus evidence 
shows that people not only grasp at straws, they also clutch at straws and even 
seize on straws. Moon (1998) observes that shiver in one's shoes (meaning 'to be 
afraid”) may at first seem to be a fixed expression, but in fact corpus evidence 
shows that every lexical item in the expression allows a modicum of variation: 
people quake in their boots, shake in their sandals, and she even found a mention 
of policemen quaking in their size fourteens. (English policemen are supposed 
proverbially to have big feet.) The meaning of the idiom is the same in all cases; 
the cognitive values of the lexical items are so similar as to be virtually identical; 
and yet the actual words used to realize the expression can be different. 

Conversely, when we examine the corpus evidence for an expression that 
might uncontroversially be classified as compositional, such as (1), 


(1) the wind was blowing from the north 


we find that the utterer of this unremarkable little sentence is in fact activating 
the meaning by drawing on a pattern containing a small but open-ended lexical 
set of items alternating with wind: gale, blizzard, hurricane, typhoon, breeze, air, 
not to mention adjectival subclassifications such as a hot dry wind, a cold wind, 
strong winds, the fenland winds, a unidirectional wind. To these can be added some 
much rarer lexical items such as tempest, trades, and zephyr. At the other end of 
the sentence forming the prototype or stereotype for this particular pattern, we 
find a very much larger set of expressions functioning as adverbials of direction: 
from the north, including from the south, from the sea, over a cliff face, up the street, 
through a spider's web, and so on. 

These very conventional expressions are best classified as realizations of non- 
compositional patterns rather than as compositional concatenations for a variety 
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of reasons. A prominent one is that the pattern so identified is contrastive: it is a 
set of stereotypical phrases that contrast with other uses of the words. For exam- 
ple, this pattern (see example 2) contrasts with other patterns having different 
meanings formed with the same verb, such as to blow a whistle and to blow up a 
bridge. 

Another reason for seeking to identify patterns of verb use is that, once a pat- 
tern is established in the language or in the mind of a speaker, it can be exploited 
metaphorically and in other ways. Some typical exploitations of this pattern of 
the verb blow, found in the British National Corpus, are shown in examples (2)-(6). 


(2) Dennis Healey |a politician| wobbles about according to which way the 
wind is blowing. 


(3) The winds of neo-liberalism are blowing a gale through Prague. 


(4) Faint liberal breezes had been blowing through the Vatican since the second 
Vatican Council. 


(5) ...the winds of change that have blown through the energy business. 


(6) The winds of fate blew for Jean Morris, winner of Middlesbrough Council's 
Captain Cook Birthday Balloon Race. 


Metaphorical exploitations bring in additional evidence that a pattern has be- 
come established. In the previous examples, the meaning can only be understood 
in relation to the the wind blows (not, say, blowing up a bridge), but cannot be con- 
fused with it, as there is no wind blowing literally. 

The aim of this short introduction to MWEs was to set the study of MWEs in 
the broad context of phraseology, and stress the obstacles in the way of linguistic 
description. In order to understand and process meaning in text, it is necessary 
first to compile inventories of patterns of language use, which can be used as 
benchmarks against which actual utterances can be compared. The following 
section presents Corpus Pattern Analysis, a method for deriving patterns from 
corpora. 


2 The Corpus Pattern Analysis framework 


Corpus Pattern Analysis (CPA) is a research procedure designed to create empir- 
ically well-founded resources for NLP applications by combining interactively 
human data analysis and machine learning. It is based on the Theory of Norms 
and Exploitations (TNE, Hanks & Pustejovsky 2004; Hanks & Pustejovsky 2005; 
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Hanks 2013). TNE in turn is a theory that owes much to the work of Pustejovsky 
on the Generative Lexicon (Pustejovsky 1995), to Wilks (1975)'s theory of pref- 
erence semantics, to Sinclair's work on corpus analysis and collocations (Sin- 
clair 1966; 1987; 1991; 2004), to the Cobuild project in lexical computing (Sinclair 
1987), and to the Hector project (Atkins 1992; Hanks 1994). CPA is also influenced 
by frame semantics (Fillmore & Atkins 1992). It is complementary to FrameNet. 
Where FrameNet offers an in-depth analysis of semantic frames, CPA offers a 
systematic analysis of the patterns of meaning and use of each verb. Each CPA 
pattern can in principle be plugged into a FrameNet semantic frame. Some work 
in American linguistics (Jackendoff 2002) has complained about the excessive 
"syntactocentrism" of American linguistics in the 20th century. TNE offers a lex- 
icocentric approach, with opportunities for synthesis, which will go some way 
towards redressing the balance. 

CPA starts from the observation that whereas most words are very ambigu- 
ous, most patterns have one and only one sense. Each word is associated with a 
number of patterns based on valency, which is comparatively stable, and one or 
more sets of preferred collocations, which are highly variable (Hanks 2012). In 
CPA, patterns of word use are associated with statements of meaning, called ım- 
PLICATURES. Each pattern has a primary implicature (the meaning of the pattern), 
and possibly a number of secondary implicatures (de Schryver 2010). To take a 
simple example, the word blow is multiply ambiguous. However, the expression 
blow your nose is unambiguous and contrasts with 60 or 70 other patterns of use 
of the same verb. 

In the Pattern Dictionary of English Verbs (PDEV; http://pdev.org.uk), the 
main output of CPA, the sense of blow your nose is stored in the pattern “[[Hu- 
man]] blow {nose}” while in the sense of “the wind blows" is represented by the 
pattern “|[Wind | Vapour | Dust]] blow [No object] [Adverbial of direction]”. Pat- 
terns may combine various kinds of categories such as semantic types (Human, 
Wind, Vapour, Dust), grammatical categories (Adverbial of direction) and lexical 
items (nose). Semantic types are taken from the corpus-driven CPA Semantic On- 
tology available at http://pdev.org.uk/#onto. These categories may fill slots in the 
pattern template based on the SPOCA model, an acronym standing for the main 
clause roles that may be filled by arguments of a verb in a proposition: a Subject, 
a Predicator, an Object, a Complement, and an Adverbial (Halliday 1994). Each 
argument can in turn be further characterized if the pattern requires it, by fill- 
ing information on the “subargumental cues” such as the nature of determiners, 
modifiers, quantifiers, prepositional phrases, and adverbs or particles. 
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Figure 1: Proportion of NYS and complete and ready verbs w.r.t. fre- 
quency range in BNC50. 


At the time of writing, PDEV covered 1,614 verbs for a total of 6,163 patterns, 
out of an estimated 5,500 total number of verbs in English (PDEV is therefore 
about 30% complete). PDEV is linked to a portion of the British National Corpus 
(BNC), BNC50, from which some of the statistics presented in this chapter are 
computed. BNC50 contains about 54 million tokens, and BNC, about 100 million. 
Figure 1 shows that the frequency distribution of complete verbs is very similar 
to that of NYS (Not Yet Started) verbs, e.g. that 40 to 45% of English verbs have a 
frequency lower than 50 in BNC50. For this reason, although PDEV is incomplete, 
it contains a representative sample of English verbs, large enough to warrant 
pilot studies. Results will need to be confirmed when PDEV is complete. 

In PDEV, most verbs have a low number of patterns: the average number of 
patterns per verb is 3.8, and the verb with the greatest number of patterns is 
break, with 83 patterns. More than a quarter of verbs have only one pattern and 
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78% of verbs have five patterns or less. Verbs can also be contrasted in terms of 
qualitative characteristics. Particularly, some of them are used in idioms, others 
as phrasal verbs, and others combine with other lexical items in set phrases, that 
we propose to call lexically grounded patterns. Table 1 indicates the number of 
entries and patterns for these MWE-related categories of verbs. 


Table 1: Number of verbs and of patterns for several MWE categories 


in PDEV. 

MWE type verbs # patterns % patterns 
Lexically grounded patterns 458 1,126 18.3 
Phrasal verb patterns 198 512 8.3 
Idiom patterns 200 453 7.3 
MWE total 548 1,649 26.7 


A lexically grounded pattern is a pattern which takes a lexical item or lexical set 
as an argument, either in subject, object, complement or adverbial position. For 
instance "[[Human|] take {responsibility} for | Anything]]" is an example where a 
lexical item, responsibility, occupies the object position. In general the presence 
of lexical items is a strong sign of fixedness, so a significant portion of lexically 
grounded patterns overlap with idioms. All in all, there are 1,649 MWE patterns in 
PDEV, which accounts for 26.7% of PDEV patterns (about 34% of verbs). As each 
pattern is linked to a set of examples from the BNC, the whole MWE pattern set 
is connected to a total of 26,392 corpus examples (an estimated 84,836 over the 
whole BNC50, i.e. 1,545 per million). 

PDEV idioms show very diverse statistical properties internally. For instance 
the estimated frequency in BNC ranges from 1 (e.g. blowing off steam) to 1,071 oc- 
currences (for as follows) in BNC50, with an average frequency of 23.5 examples 
in BNC50 and a high standard deviation (67.2). 70% of idioms have 5 or more as- 
sociated examples and 90% have less than 40 examples. The verb with the highest 
number of idioms is throw, with 24 idioms. Verbs with idioms on average have, 
for 64% of them, one idiom, for 19% of them, two idioms, and for 17%, three or 
more idioms. 


3 A CPA study for English-French translation 


The case study presented in this section focuses on bite, because it was found 
to encapsulate a large number of facts about English verbs, and particularly id- 
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iomatic structures. This verb is compared to the French mordre, which translates 
to ‘bite’ in its primary literal meaning: ‘using teeth to cut’. We will observe how 
these verbs are used in each language, identify their common features and di- 
vergences by applying CPA to corpora. Bite was analysed using a sample of 500 
lines from the BNC, and the same sample size was extracted from the Frtenten 
corpus (11 billion words; Jakubicek et al. 2013) for mordre. 

Bite and mordre share interesting similarities in terms of their syntactic and 
semantic properties. Both verbs are mostly direct transitive, see examples (7) and 
(8), and can sometimes be accompanied with a locative adverbial, to indicate the 
[[Body Part]] bitten. Both verbs are also used in an intransitive pattern where 
the bitten entity, typically found in object position, is moved to a prepositional 
complement position, with into (dans in French) as preposition, see examples (9) 
and (10). 


(7) Those dogs bit the neighbours, the dustbin men, visiting aunts and each 
other. 


(8) Le propriétaire ou le détenteur d'un chien qui a mordu une personne ou un 
autre animal a l'obligation de le déclarer au commissariat de son 
arrondissement. 


"Ihe owner or the holder of the dog which has bitten a person or another 
animal is under the obligation of declaring it to the district police” 


(9) TIl wager that your salivary glands started pumping out liquid as you 
imagined yourself biting into the lemon. 


(10) Je mords dans une pêche : un goût d'eau sucrée accompagné d'un sentiment 


de vide. 


‘I bite in a peach: a taste of sweet water together with a feeling of 
emptiness.’ 


These syntactic patterns are frequently employed in different situations which 
sometimes share very little in common with the literal meaning of the verb. 
To contrast these uses, CPA entries make use of semantic types which charac- 
terise the semantic properties shared by the collocates found in a given syntactic 
slot. In the literal sense of transitive patterns, bite and mordre typically collo- 
cate with [[Human]] (with the particular case of vampires) and ||Animal]] (e.g. 
dogs) as subjects, and with [[Human]], [[Animal]], and [[Body Part]] as objects. 
Other [[Physical Object]] nouns (e.g. pillows, coins, pencils) were found in English, 
but not in French, although they could be found in a larger sample. Transitive 
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patterns of bite were also found to combine with |[Eventuality]] as subject and 
[[Human|] or [[Institution]] as objects, as in example (11). 


(11) Provincial had been bitten by its own success. 


In this case, the pattern means “[{Eventuality]] adversely affect |[Human]] or 
[IInstitution]]”. The construction bite + into was also found with a metaphorical 
pattern expressed as “[[Event]] bites into |[Event]]”, sharing the same meaning 
as the previous pattern (signaling an adverse effect). These metaphorical uses of 
bite seem to be English-specific: no such pattern was found in the French sample. 
This is because French typically uses ronger ‘gnaw’, as in example (12). 


(12) a. The recession is biting deeply into industry. 


b. La recession ronge durement l'industrie. 


When English speakers use bite with direct objects such as nails or fingers 
to mean 'chewing at one's fingernails, biting the tips off’, French speakers use 
ronger for ongles and doigts respectively. In this case, it is also considered as a 
distinct pattern in French. Other patterns were found, such as “|[Physical Object 
1]] bite inlinto [Physical Object 2]]”,' where the subject is neither [[Human]] nor 
[[Animal]]. This pattern can only be translated to French with mordre to cover 
uses where "[[Blade]] makes small cuts into [[Physical object]]”. When the subject 
is acid, signalling the corroding effect the acid has on metal, French uses ronger. 
For other types of object nouns, such as ploughs, French would use the phrasal 
expression se planter + dans. 

Semantic types can also help to contrast existing patterns from uses which 
combine with specific animals, e.g. [[Snake]], which was found both in French 
and English, and which refer to a different situation, defined as “[[Snake]] stabs 
[[Human|] or [[Animal]] with fangs, typically injecting poison under the skin”. 
However, when considering ||Insect]] (e.g. mosquitoes) in subject position, the 
normal French verb is piquer (see example (13) below). 


(13) a. The mosquitos came up and bit me in the dark. 


b. Les moustiques sont venus et m'ont piqué dans le noir. 


However, bite does not collocate with nouns of other flying bugs such as wasps, 
bees, or hornets,’ whereas these nouns can be used indifferently with piquer. This 
language-specific feature can be explained by the extra-linguistic fact that insects 
bite to feed, but bees, wasps, and hornets possess a specific device, positioned at 


'English also uses patterns with the phrasal verb eat away for this meaning. 
? Although mordu par les moustiques is acceptable. 
?English uses the verb sting. 
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the bottom of their bodies, used to kill or in self-defence. This is the only pat- 
tern where piquer can be used as a translation of bite. The pattern "[[Human|] 
or [[Animal]] bite through [[Physical_Object]]” also has a literal meaning, but 
cannot be translated using mordre. The best translation equivalent appears to be 
grignoter (literally nibble), since it keeps the notion of ‘using teeth’, and correctly 
translates 'insects biting through leaves’. However this verb does not translate 
the fact that the bitten entity is filled with holes. The verb bite was only found 
in a single intransitive use, “|[Process]] bites”, with the meaning “[have] a notice- 
able effect, usually an adverse effect", as in the recession bit deeper. This would be 
translated into French with the expression se faire sentir (literally 'to be felt"). The 
verb mordre was also found in metaphorical patterns which could not be trans- 
lated with bite, namely “[[Building]] mord [[Area]]”, as in (14), and “|[Vehicle]] 
mord la route", as in (15). 


(14) Certaines des constructions mordaient sur des terres privées. 
“Some of the buildings encroached on private lands: 

(15) Quand vient le temps d'effectuer un dépassement, le véhicule mord la route. 
"When the time comes to pass the car in front, the vehicle grips the road. 


In addition to these patterns, 6 idioms were found for mordre (see Table 2), and 
10 idioms for bite (see Table 3). 


Table 2: Idiom CPA patterns for the verb mordre. 


No Pattern / Implicature Frequency % 
4  [[[Human]] | le poisson] mord (fà l'hamecon | à l'appat]) 10 2 
Human)]] takes the bait (= is lured to do something that has bad 
consequences) 
7 [[Human]] mord (la vie à pleines dents} 6 12 
Human]] enjoys life to the full [literally, *bites life with full teeth] 
9  [[Human]] se mord (les doigts] 21 42 
Human] experiences a bitter time [literally, "bites his/her fingers] 
11 [[Human 1 ]] fait mordre [la poussière ] [à [[Human]]] 6 12 


Human 1 ]] causes [[Human 2 ]] to bite the dust (= to die) or to lose a 

challenge [the latter sense only in French] 

12 [le serpent] se mord [la queue] 16 32 
Human]] is stuck in a [[State of affairs]] and cannot find a way out 

literally, "the snake bites his own tail] 

16 [[Human]] ne mord pas [NO OBJ]] 6 12 
Human]] does not bite (= is harmless) 
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Table 3: Idiom CPA patterns for the verb bite. 


No Pattern / Implicature Frequency % 
13 Human 1 bites Human 2's head off 5 122 
Human 1 speaks sharply and unkindly to Human 2 
14 Human bites REFLDET lip 8 1.96 
Human grips his or her lip firmly with the teeth 
15 Human bites off more than |[Human]] can chew 4 0.98 


Human undertakes a task that is too difficult for him or her to 
accomplish successfully 


16 Human bites the hand that feeds [[Human]] 5 122 
Human attacks his or her benefactor 

17 Human or Institution bites the bullet 21 543 
Human or Institution decides to do something necessary but unpleasant 

18 Human is bitten by the [MOD] bug 7 17 
Human becomes very interested in [MOD] 

19 Human bites the dust 2 0.49 
Human dies suddenly and violently 

20 Entity or Process bites the dust 8 1.96 
Entity or Process comes to a sudden and unwelcome end 

21 Human bites REFLDET tongue 8 1.96 
Human makes a desperate effort not to say what is on his or her mind 

22 Once bitten twice shy 3 0.73 


An unpleasant experience causes someone to be more cautious in future 


These idioms share little in common (apart from the correspondence between 
patterns 11 in French and 19 in English) and do not involve the notion of ‘teeth 
cutting’. Thus the correct French to English translation (and vice versa) required 
knowledge that is encoded in CPA patterns. Pattern 12, for instance, le serpent 
se mord la queue, is used to refer to situations where serpents ‘snakes’ are not 
involved, a phenomenon generally referred to as non-compositionality. In the 
next section we propose to measure this property as well as other important fea- 
tures, such as rigidity, using statistical measures. These measures will be applied 
to idioms which will be the focus of §4. 


4 Statistical measures for the characterisation of MWEs 


In this section, we will describe the use of statistical measures to automatically 
characterise the flexibility of MWEs. We feel that this is an important research 
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topic, as it can contribute to describing in which respects MWEs are flexible and 
help to speed up their extraction from corpora. 


4.1 Word association measures and lexicography: PMI 


In psycholinguistics, word association means for example that subjects think of 
a term such as nurse more quickly after the stimulus of a related term such as 
doctor. Church & Hanks (1989) redefined word association in terms of objective 
statistical measures designed to show whether a pair of words are found together 
in text more frequently than one would expect by chance. PMI (Point-wise Mu- 
tual Information) between word x and word y is given by the formula 


(16) Ti, al = log? Pie, y)/ P(x).P(y) 


where P(x,y) is the probability of the two words occurring in a common context 
(such as a window of 5 consecutive words), while P(x) and P(y) are the prob- 
abilities of finding words x and y respectively anywhere in the corpus. PMI is 
positive if the two words tend to co-occur, 0 if they occur together as often as 
one would expect by chance, and less than 0 if they are in complementary distri- 
bution (Church & Hanks 1989). PMI was used by Church & Hanks to examine the 
content word collocates of the verb shower, which were found to include abuse, 
accolades, affection, applause, arrows and attention. Human examination of these 
lists is needed to identify the seed members of categories with which the verb can 
occur, such as [[Speech Act!) and [[Physical Object], giving at least two senses 
of the verb (Hanks 2012). 


4.2 Span, rigidity, diversity and idiomaticity 


Smadja (1993) recommends that collocations should not only be measured by 
their strength, such as by using the z-score, but also by their flexibility. We pro- 
pose to characterise the flexibility of a multiword expression using four statistical 
measures, each focusing on a dimension of variation. 

A MWE can be characterised by its mean span MEAN SPAN, that is, the stretch 
of text it is found to cover on average. This can be measured using the mean y of 
the relative distances between two words making up the MWE, and computed 
as follows: 


(17) p(X Y) = 2 dist(X;, Y;) 
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A MWE can also be further characterised by its RIGIDITY. This can be measured 
using the standard deviation c of the relative distances between the two words: 


(18) " 
c (X, Y) = | 2 5 (dist GG. Y) - n, Y)? 
i=1 


For standard deviation, the minimum value when all the examples have identi- 
cal span is 0, and there is no theoretical upper limit. Higher values would indicate 
a flexible or semantic, rather than a rigid, lexical collocation. 

In a study of David Wyllie's English translation of Kafka's Metamorphosis, 
Oakes (2012) found that stuck fast and office assistant had mean inter-word dis- 
tances of 1 with a standard deviation of 0. This showed that in this particular 
text, they were completely fixed collocations where the first word was always 
immediately followed by the second. Conversely, collection and samples had a 
mean distance of 2.5 with a standard deviation of 0.25. This collocation was a 
little more flexible, occurring both as collection of samples and collection of textile 
samples. Mr. Samsa had a mean distance of 1.17 and a standard deviation of 0.32. 
This is because it usually appeared as Mr. Samsa with no intervening words, but 
sometimes as Mr. and Mrs. Samsa. 

Another way of looking at the flexibility of a collocation is by measuring the 
DIVERSITY of surface forms found for that collocation. A rigid collocation, where 
all found examples are identical in form and span, has very low diversity, while 
a collocation which has many surface forms has much higher diversity. One 
measure of diversity, popular in ecological studies, is Shannon's diversity index, 
which is equivalent to entropy in information theory, and given by the formula: 


(19) N 
E=- T Pilog,Pi 
i=1 


E is entropy, N is the number of different surface forms found for the colloca- 
tion, i refers to each surface form in turn, and P; is the proportion of all surface 
forms made up of the surface form currently under consideration. The choice of 
logarithms to the base 2 ensures that the units of diversity are bits. The minimum 
value of diversity (when all the examples of a MWE are identical) is 0, while the 
maximum value (when all the examples occur in different forms) is the logarithm 
to the base 2 of the number of examples found. 

Finding statistical evidence for the flexibility of a sequence of words does not 
automatically entail that all the examples of the sequence belong to a MWE, and 
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that the reading is non-literal. We therefore propose to measure the IDIOMATICITY 
of a MWE in context, by taking the ratio of the number of idiomatic occurrences 
of the expression divided by its total number of occurrences: 


(20) 
number of idiomatic occurrences 


Idiomaticity (x,y) = 
y Gs y) total number of occurrences 


A value of 1 would indicate that the MWE is always idiomatic, while a lower 
value would indicate that the MWE can be ambiguous with respect to its id- 
iomatic reading. It must be borne in mind that this equation depends on a num- 
ber of factors such as the overall frequency of the verb of a MWE in the specific 
language. The more frequent in everyday language the constituents of the MWE 
are, the more probable for them to be encountered in a corpus in their literal 
meaning. This is not related to the idiomaticity of the expression per se, which 
has to do with the opacity of the expression: the more opaque (as opposed to 
transparent) it is, the more idiomatic it is. 


4.3 Worked out example 


To illustrate how the values of each measure are computed, we propose a worked 
out example based on a pair of words used as boundaries, bite and dog, in a sample 
of 10 examples taken from pattern 1 of bite: 


(21) "[Human 1 | Animal 1] bites [Animal 2 | Physical Object | Human 2]” 


We chose this pair because it is a strong collocation (PMI = 7.7 in BNC). To 
apply our statistical measures, the first thing to do is to compute the distance 
between the boundary words. First, it is worth noting that we lump together 
alternative surface forms of the same boundary word, so we consider both dogs 
and dog as one word. Different decisions at this stage may lead to different results. 

Figure 2 provides an example using signed distance (left or right): in the first 
example, bite is four words away to the left of bite, the distance is therefore -4. To 
compute the mean span, however, we recommend using the unsigned distance 
(i.e., 4 for the first example), but it is important to use the signed distance to 
compute the standard deviation, in order to capture word order variation. The 
unsigned text distances are therefore, in order of appearance of the examples, 
4,4,3,2,4,1,3,2,1,2. 

The mean y characterises the mean span of an expression: bite and dog are 2.6 
words apart. 
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(22) 
4+4+3+2+4+1+3+2+1+2 
H{bite,dog} = 10 = 2.6 


The standard deviation characterises the rigidity of an expression and makes 
use of the mean of the signed distances ju’ computed as follows: 


(23) 


Hite dog = | 4)+(—4)+3+( EE 2)+(-1)+(-2) — ue 


The standard deviation can therefore be computed as: 


(24) 


f 4— (—0.6))? -(—4— (—0.6))? -- (-2— (—0.6))? 
Pe E = m = 54 = 2.76 


The score obtained for bite and dog is indicative of a low rigidity (2.76). 

To compute diversity (entropy), we extract all the patterns of word forms be- 
tween boundaries and count the frequency of each pattern class. Again, char- 
acters could also be used as the basic unit, but we use words; the string of the 
pattern can be characterised in various ways, we use word forms. A pattern is 
a full string between boundaries, with the null class accounting for cases where 
boundary words are adjacent. For X = {dog,dogs}, Y = {bite,bites,bit,bitten}, and i = 
{that barks doesn't, that had been, another. In, to, by a police, { }, his pet, are, always}, 


‘A dog* that? barks? doesn't! bite? 
of the dogs? that? had been” bitten? 
in saliva when one animal bites? 
who had trained his dog? to! — bite? 
/p><p> He was chased and bitten? 
t was saved when her dog! bit? 
heltenham yesterday after biting? 
time by their own dog? are! bitten? in the bedroom. In our bree 
d. </p><p> After that dogs! bit? me on the feet. Blood came 
/ herself that dogs? always! bite 10 people, especially them. T 


, replied Antonio Navarro, 
and strayed: scared that th 
another). in? dogs? , one oft 
Arabs, and who informed 

by! a? police? dog? and then a 
him. </p><p> The 22-year- 
his! pet? dog? , which was at 


OMAN AD UT FW DY rä 


Figure 2: Example of calculated distances for the pair (bite, dog) in con- 
cordance for bite. 
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P; corresponds to the number of times the string is observed in the sample, di- 
vided by the total number of examples (in our case, 10). The entropy is computed 
as follows: 


(25) E,vitedog} = — (30 1082 15) + (15 1082 35) + + (15 1082 Zell = 3-12 


The entropy is quite high as there is no particular pattern that dominates the 
sample: only the null pattern occurs twice, but the others, only once. Finally, no 
expression formed with bite and dog was found to have an idiomatic reading, 
therefore the idiomaticity is equal to 0. 


The proposed measures are described for two variables. However, many idioms 
include more than two words, such as let the cat out of the bag. In such cases we 
take the span of the idiom as the distance from the first word to the last, which 
for this example would be 6 words. 


5 A contrastive statistical analysis of idioms 


In a pilot experiment on the annotated sample of the BNC corpus of bite, we 
found that the phrase bite the bullet was maximally rigid, as it occurred all 9 
times in exactly that form. Thus the standard deviation of the collocation span 
was 0, and its diversity was also 0. In contrast, the phrase bitten by the ...bug was 
extremely flexible, occurring all 6 times in different forms such as bitten by the 
travel bug, bitten by the London bug, and bitten by the bug of the ocean floor. The 
standard deviation of spans was relatively small (0.48), reflecting that in all cases 
but one the variation consisted of the insertion of a single word, but the diversity 
index was at its maximum value for a set of 6 examples, log2(6) = 2.585. 

The results for bite were borne out when the experiment was repeated on a 
larger corpus, the entire BNC. Table 4 shows the results obtained for English id- 
ioms. Idioms are represented by their boundary words and the table provides the 
scores using standard measures of collocational strength (PMI, t-score, and Log- 
Dice), along with the absolute frequency and our new measures: idiomaticity, 
entropy, mean span, and standard deviation. 

In the full BNC, there were 19 occurrences of “|bite] by X the bug" altogether, 
where “[bite]” stands for any grammatical variant of bite, such as bitten, and “X” 
stands for any number (possibly zero) of intervening words. 16 of these were 
idiomatic, including 3 variants of the farewell sleep tight, don't let the bed bugs 
bite, and 3 were literal as in I've been bitten by bugs in a hooker's bed. This gave an 
idiomaticity of 16 / 19 = 0.842. Of the idiomatic examples, almost all were unique, 
such as bitten by the travel bug - the other bugs included puppy love, acting, racing, 
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Table 4: Summary of scores for some idioms of bite. SD: Standard De- 
viation 


Idiom Freq PMI t-score Log- Idioma- Entropy Mean SD 
(total) Dice ticity span 

[back] [bite] 87 5.914 10.380 5.549 0.989 0.338 1057 0.277 

[bullet] [bite] 36 10484 6.477 8.561 1 1.069 2.055 0.404 

[head] [bite] [off] 30 6.009 7.639 5.600 0.775 3381 3.032 2.721 

[dust] [bite] 26 8.918 5.088 7438 1 0.235 2.08 0.192 

[bug] [bite] 19 10.589 4.688 7.894 0.842 3.326 3.125 2.578 


flower pressing and showbiz. On 3 of these occasions the nature of the bug did not 
appear between bitten and bug, which were simply connected as bitten by the bug. 
The Shannon diversity, resulting from pattern classes of 4, 3 and 3 members and 9 
unique occurrences, had a very high value of 3.326. In terms of rigidity, the mean 
distance between bite and bug was 3.125, with a high standard deviation of 2.578. 
This was because cases such as the acting bug really bit me used the inchoative 
alternation, so bug appeared before bit. Also, influencing rigidity was the fact 
that even in the active voice, the number of intervening words? could vary. 

The MWE bite the bullet occurred in 36 sentences altogether, there were no 
literal examples at all, but that MWE appeared as mentions of both a racehorse 
and a pop song. Of the other 34 examples, the vast majority (29) were exactly in 
the form “[bite] the bullet", the remainder being in the forms bit the ideological 
bullet (3), reversed as in a harder bullet to bite (1), and a statement by President 
Bush about an opponent: “I bite bullets, he bites nails". The idiom was rather rigid, 
with a mean span of 2.055, and a fairly low standard deviation of 0.404. Diversity 
was also low at 1.069. 

The results on French idioms were obtained from the tagged sample from the 
Frtenten corpus. The results obtained here have taken only a part of the corpus 
into account. In the future, we will perform an exhaustive analysis of the remain- 
ing 196,500 examples. The scores are given in Table 5, using the same headers as 
Table 4. 

The idiom le poisson mord a l’ hameçon is a popular expression in French which 
means ‘to take the bait’ (see Table 2). As illustrated in Table 5, it was found in 
3 different forms, which, despite varying mean span and frequency, were each 


“Its French adjectival counterpart, mordu de is also diverse: mordue des nuitees en famille sous 
la tente 'fanatical about nights camping with the family’, mordus des jeux on ligne 'addicted to 
on-line games’ and mordue d'esperanto ‘bitten by the Esperanto bug. 
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Table 5: Summary of scores for some idioms of mordre (500 lines sam- 
ple). SD: Standard Deviation 


Idiom Freq PMI t- Log- Idioma- Entropy Mean SD 
(total) Score Dice ticity span 

poisson] [mordre] 4 7340 1988  -2222 0.75 0 1 0 
hamecon] [mordre] 4 12.259 2 2.664 0.75 0 3 0 
appát] [mordre] 2 10.475 1.413 0.894 1 0 3 0 
vie] [mordre] 6 4.434 2523  -5126 1 1792 2.833 0.372 
doigt] [mordre] 21 9387 4789 -0.174 0.952 108 1190 0.154 
poussiére] [mordre] 6 9.360 2.446 -0.204 1 0 1 0 
queue] [mordre] 16 9670 4118 0109 1 0.34 1187 0.527 
serpent] [mordre] 13 11022 3.604 1457 1 17 1.846 0.591 


found to be maximally fixed (standard deviation - 0) and minimally diverse (en- 
tropy - 0). 

The pattern “[[Human]] se mord {les doigts}” rarely took its literal meaning 
in French, standing for “a person experiencing a bitter time for his past actions’ 
in 20 cases out of 21. It usually occurred in the corpus as mordre les doigts, but 
sometimes as se mord encore les doigts “bites his fingers again’, mordrait un peu 
souvent les doigts 'bit his fingers a bit often' and other variants. This gave a mean 
of 1.19, a standard deviation of 0.15, and an entropy of 1.08. 

The idiom mordre dans la vie à pleine dents was also found as mordre la vie 
à pleines dents. Table 5 lists the scores when both variants are combined. If we 
consider vie as the boundary word (à pleines dents was only found once in a 
mention of a song), mordre dans la vie occurred 4 times with mean span of 3, 
while mordre la vie was found twice with mean span of 2.5. Since 5 out of 6 
examples had a distance of 3 words, the standard deviation was quite low (0.372); 
however the idiom had a high entropy (2.833), as mordre la vie contributed 2 
different unique pattern classes to the idiom. 

If we compare English and French, the corresponding phrases mordre la pous- 
siére and bite the dust both have standard deviations for their spans of 0, since 
in the BNC and Frtenten corpora the verb is always exactly 2 words before the 
noun. However, as can be seen in Table 2, mordre la poussiére can have the addi- 
tional use ‘losing a challenge’ which was not found in English. The MWEs bite 
one's fingers and its apparent French translation se mordre les doigts are in stark 
idiomaticity contrast. While bite one's fingers was always found to be literal (5 
cases), all instances of se mordre les doigts (21) were found to be idiomatic. It is 
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worth noting that translation systems unaware of these facts will tend to make 
two mistakes (as can be checked with Google Translate): when translating from 
French to English, they will fail to translate the figurative meaning of se mordre 
les doigts with an equivalent idiom like kick oneself. From English to French, they 
will fail to translate the literal meaning of bite one's fingers and translate it with 
the frequent idiomatic sequence se mordre les doigts. For the verbs mordre and 
bite, we have shown that the measures of mean and standard deviation of span, 
Shannon Diversity, and idiomaticity give reasonable results as they reflect the 
flexibility of a MWE. We could also suggest a measure of constructional flexibil- 
ity, which might be the ratio of times a MWE occurs in the active voice divided 
by the number of times the MWE occurs altogether, whether in the active or 
passive voice. 


6 Generalization of statistical measures 


Evaluating the applicability of statistical measures to different languages is one 
way to evaluate their validity. This section describes other methods to test the 
generalizability of measures. 


6.1 Comparison with cognitively salient idioms 


Hanks (2013: 5,21,214) makes a distinction between expressions that are cogni- 
tively salient (roughly equivalent to "easily called to mind") and socially salient 
(roughly equivalent to "frequently used"). He suggests that cognitive salience 
and social salience are independent variables, or may even be in an inverse re- 
lationship: that is, frequently used expressions are buried deep in the language 
user's subconscious mind and are not necessarily easily called to mind. The id- 
ioms Kick the bucket and spill the beans are probably the most cognitively salient 
and most frequent idioms cited by linguists. Other idioms cited in this chapter 
are grasping at straws, the way the wind blows, or shivering in one's boots. These 
idioms, along with 4 idioms involving bite, make up the set of 10 idioms used for 
the experiments described in this section. 

In the BNC, kick the bucket has 21 occurrences, although another 4 sentences 
containing both words were discounted as kick and bucket appeared in separate 
clauses. Another 8 were from a linguistic discussion of the phrase, as in "notice 
‘kick the bucket’ appears as a verb phrase”. Only 5 were idiomatic, in the sense of 
to die: 4 of these were in the exact form kicked the bucket, while the other had a 
sequence of 9 words between kicked and bucket, in Arthur kicked the detonator of 
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the bomb, and consequently the bucket. This gave a mean separation of 3.5 words, 
a standard deviation of 1.870, and a modest diversity of 0.721. However, these 
results were biased by a small sample size and a single creative use of language. 
This left 8 literal examples of the phrase, as in leaving his bucket to be kicked over 
by the cow. Thus idiomaticity was 5/21 = 0.238. 

In contrast, the phrase spill the beans, found 42 times overall in the BNC, was 
almost always (40 times) found in the idiomatic sense of ‘reveal a secret’. The only 
exceptions were when the phrase was used as the title of a book, A style guide 
to the New Age called spilling the beans, and a television programme Superchefs 
spill the beans, where the phrase spill the beans takes both the literal and the 
figurative sense at the same time. The phrase was used just once in its purely 
literal sense, where a guest house owner was dreading a dozen or more children 
spilling their beans, wetting the beds, hoarding old crusts. Thus idiomaticity was 
very high: 40/42 = 0.952. Of the 40 idiomatic cases, the vast majority were in the 
exact form *[spill] the beans" (37); 2 were in the passive voice (when the beans 
are spilled and the beans have been spilled), and just one replaced the with a few: 
he spilt a few beans. The mean separation was 2.025, the rigidity as measured 
by the standard deviation was 0.987, and diversity as measured by entropy was a 
lowish value of 0.370. According to these results, spill the beans is more idiomatic, 
less flexible and slightly less diverse than kick the bucket. These findings are in 
stark contrast with reports that MWES like spill the beans are more flexible than 
the relatively well behaved kick the bucket. Although kick the bucket is more 
idiomatic in the sense that it is fully opaque, it occurs more often in the text 
in its literal meaning because its literal meaning is more frequent in everyday 
language. 

An idiom which stands out in Table 6 is way the wind blows, which was by far 
the strongest collocation according to the t-score and LogDice measures, and the 
lowest idiomaticity score (or having the greatest proportion of literally-intended 
examples). bite ...bug had highest entropy, as one can metaphorically be bitten 
by many kinds of bug. Finally bite ...hand ...[benefit] had the greatest mean span 
and standard deviation of span. 


6.2 Inter-annotator agreement 


Another way of demonstrating the validity of a statistical measure, such as MWE 
idiomaticity or mean span, is to determine the Inter-Annotator Agreement (or 
Inter-Rater Reliability, IRR). This is the degree to which two or more observers 
might concur on a classification or annotation task. A measure is only valid to 
the extent that humans can agree on the classification of the individual instances 
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Table 6: 10 English idioms retained for generalization experiments. 
SD: Standard Deviation 


Idiom Freq PMI t- Log- Idioma-Entropy Mean SD 
(total) score Dice  ticity span 

back] [bite] 87 5.914 10.380 5.549 0.989 0.338 1.057 0.277 
bullet] [bite] 36 10.484 6.477 8.561 1 1069 2.055 0.404 
head][bite] [off] 30 6.009 7.639 5.600 0.775 3.281 3.032 2.721 
bug] bite] 19 10.589 4.688 7.894 0.842 3.326 3.125 2.578 
hand]|bite] [BENEFIT] 15 5.584 7.639 5.196 1 2.463 5.933 5.842 
bean] [spill] 40 10.947 6.705 8.917 0.952 0.370 2.025 0.987 
straw] [grasp/clutch] 33 9.865 6.077 8172 0.892 2.213 3.485 1.623 
way] [wind] [blows] 21 10.663 25.264 10.652 0.676 2.488 3.5 0.534 
shoe/boot|[quake/shiver 12 5.043 5.056 5.608 1 2.057 3.417 1.382 
/shake] 

bucket] [kick] 5 8.647 4349 7.004 0.238 0.721 3.500 1.870 


which contribute to the measure. For example, do they agree on whether a MWE 
is being used in its idiomatic sense or not, and where it starts and ends? IRR 
falls in the range 0 for only random agreement to 1 for perfect agreement. As 
an illustration, we estimated the IRR, using Krippendorff's a measure,” between 
two native speakers of English as regards the span and idiomaticity of the phrase 
kick the bucket. There were 26 sentences in the British National Corpus contain- 
ing both kick and bucket. An o value of 1 denotes perfect agreement among the 
annotators, and 0 shows that agreement occurred only by chance. The instruc- 
tions given to each annotator were as follows: 


For each sentence, choose one of the following: 


1) the phrase kick the bucket (or a grammatical variant of it) does not appear 
in the sentence; 


2) the phrase kick the bucket (or a grammatical variant of it) is idiomatic, 
and means “to die’; 


"Krippendorff's a may be calculated using the ‘irr’ package in the R statistical programming lan- 
guage. The package ‘irr’ can be installed by the following command: install.packages("irr", 
repos - "http://cran.r-project.org") 

The annotators' responses should be stored in a matrix, where each row corresponds to 
annotators' response values. The R command to create a matrix for three examples and 2 an- 
notators is, for example: m = matrix(c(1,1,3,3,1,2), nrow-2). 
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3) the phrase kick the bucket (or a grammatical variant of it) is literal, and 
actually means to physically kick a bucket. 


If you answered 2) or 3), use | to show where the phrase kick the bucket 
begins and | to show where it ends, as in the example: “I’m too young to 


[kick the bucket]”. 


In our experiment to find the agreement of the native speakers as to whether 
the phrase kick the bucket was absent, literal or idiomatic, Krippendorff's a was 
0.745. A value between 0.6 and 0.8 is said to be “good” agreement (Altmann 1991: 
404). 

This experiment was modified to consider only those cases where the annota- 
tors considered the phrase kick the bucket to be present? we were looking at the 
agreement between the annotators in distinguishing literal and idiomatic uses, 
and Krippendorff's o; was 0.635, still “good”. 

To look at the agreement with respect to the span of the idiom, the values in 
the matrix were replaced with the number of words between the square brackets 
marked by the annotators, or NA if they did not find the idiom in the sentence.? 
The annotators agreed in every case where they both marked off the start and 
end of the idiom (o = 1), showing that the limits of the idiom kick the bucket 
were clear-cut to these native speakers. Thus according to this small experiment, 
the measures of idiomaticity and mean span are valid for the expression kick the 
bucket. 


6.3 Correlation and relatedness of measures 


While the previous section illustrated techniques to test the validity of statistical 
measures, this section describes a final experiment focusing on the relatedness 
of different measures. To do this, we compared the values of our set of 10 idioms 
(see Table 6) according to 10 measures. These included the measures of mean and 
standard deviation of idiom span, Shannon Diversity and idiomaticity, compared 
with four standard measures for collocation strength: frequency of collocation, 
PMI, t-score and LogDice. Both the t-score and LogDice are used by the Sketch 
Engine lexicographers' tool (Rychly 2008). 


°Krippendorff’s o is found by the following command: irr:kripp.alpha(m,"nominal"). 

"The matrix was modified so that all the 1s (denoting absence of the phrase) were replaced by 
“NA” (not applicable). 

*This type of numeric data is called “ratio” data, so the appropriate command to calculate Krip- 
pendorff's o is: irr:kripp.alpha(m, ratio”). 
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To determine whether these measures were independent of each other or 
whether one acts as a predictor for another, Spearman's rank correlation coef- 
ficient was computed for each pair of measures. This statistic was preferred to 
the Pearson correlation coefficient, as the sets of values for some ofthe individual 
measures were not normally distributed. The correlations between the measures 
are shown in Table 7. The most statistically significant correlation (p = 0.002, cor 
= 0.88) was between PMI and LogDice, suggesting that these measures of colloca- 
tional strength agree with each other well. Another significant correlation was 
the inverse correlation between frequency and mean span (p - 0.008, cor - -0.78). 
Thus there was a tendency for more frequent idioms to be shorter (and to a lesser 
extent, not statistically significant) more rigid in their structures. There was no 
significant correlation between any of the measures in Table 4 and Table 5 with 
either of the measures of collocational strength, frequency and PMI. 


Table 7: Correlations between scores in the 10 idiom study. 
SD: Standard Deviation 


Idiom Freq PMI t- Log- Idiom- En- Mean SD 


(total) score Dice aticity tropy span 
Freq 1 
PMI 0.36 1 
t-score 0.51 0.03 1 
LogDice 0.24 0.88  -0.04 1 
Idiomaticity 0.23 -0.42 -0.09  -0.30 1 
Entropy -0.39 0.08 0.01 0.01 -0.29 1 
Mean span -0.78 -0.19 -0.15  -0.07 -0.25 0.43 1 
Standard -0.61 -0.27  -0.30 -0.44 -0.20 0.62 0.55 1 
deviation 


These results suggest that the new measures of idiomaticity, entropy, mean 
span and standard deviation of span may not be useful for discovering new MWE, 
but as we have shown, are useful for describing the characteristics of MWE once 
discovered. 


7 Conclusions and perspectives 


Sinclair (2004) wrote that the so-called "fixed phrases" are not in fact fixed: most 
phrases in English display some variety of form. "Variation gives the phrase its 
essential flexibility, so that it can fit into its surrounding context". Conversely, 
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each word cannot be considered as a simple “Lego brick” which can be fitted 
in a slot-and-filler system, as corpus-based investigations reveal that each word 
preferentially selects other words, echoing J.R. Firth's maxim that "You should 
know a word by the company it keeps". In this context we have proposed to use 
Corpus Pattern Analysis as a technique to describe word patterns found in cor- 
pora, and have applied this technique to two verbs in French and English. CPA 
is a corpus-based technique to detect the lexical, syntactic, and semantic prefer- 
ences of verbs, such as the fact that bite preferentially selects mosquitoes and bugs 
while sting normally selects bees, wasps and hornets. The application of the CPA 
methodology to a French corpus revealed however that mordre, the French trans- 
lation of bite in examples such as dogs bite, was neither used with mosquitoes nor 
with bees: French speakers prefer to use piquer ‘sting’ for most kinds of “flying 
entity aggression". This suggested that patterns of words are more reliable units 
of translation than words in isolation, which opens up new research perspectives 
for using CPA in Translation studies and Machine Translation. 

In this chapter, we proposed to use statistical measures which could be ap- 
plied to any MWE in any language, by illustration on French and English. These 
new statistical measures characterise the flexibility of a MWE based on text dis- 
tance: the mean span of MWEs, the standard deviation of the distance between 
their boundary words, their internal diversity, and their idiomaticity ratio. The 
results obtained by the application of these measures to bite and mordre revealed 
that each captured useful features of MWEs which compared favourably with 
intuitive notions of flexibility and compositionality. It is worth noting that the 
implementation of these measures required us to make a number of decisions ex- 
plicitly, particularly deciding on a basic unit such as the word or character. Per- 
spectives include testing these measures on other languages, particularly those 
with so-called free word order, and application to Machine Translation. 

In his analysis of extended units of meaning, Sinclair (1991) noted, as we have 
done in our discussion of bite and mordre, that idioms can carry across to other 
languages. In his example, the Italian equivalent of naked eye is a occhio nudo. 
While this is true of many expressions, the contrastive analysis proposed in this 
chapter also suggests that the semantic space occupied by a single lexical item 
can be covered by several lexical items in another language. The MWE naked 
eye also exhibits a phenomenon we have not examined in this chapter: there is 
greater consistency of patterning to the left of the collocation than to the right. 
This suggests that we could use our measures to find the rigidity or diversity not 
only of the MWE itself, but of its context on either side. We could also look for 
the semantic prosody associated with MWE - for example, things seen with the 
naked eye tend to be difficult (small, “weak” or “faint”) to see. 
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Abbreviations 
AI Artificial Intelligence PDEV Pattern Dictionary of English Verbs 
CPA Corpus Pattern Analysis PMI ` Point-wise Mutual Information 
IRR inter-rater reliability wan Word Sense Disambiguation 
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Multiword expressions and the Law of 
Exceptions 


Koenraad Kuiper 
University of Canterbury, New Zealand 


This chapter proposes the existence of a linguistic universal, the Law of Exceptions. 
It hypothesizes that a relationship exists between the grammar of a language and 
its lexicon such that all regularities expressed in the grammar of a language are 
matched by exceptions which are manifested in the lexicon of that language. It is 
also proposed that lexical idiosyncrasies are of two types. Type 1 idiosyncrasies are 
in the nature of arbitrary restrictions on options provided in the grammar while 
Type 2 idiosyncrasies involve breaches of the rules of the grammar. To test this 
law requires an initial examination of the linguistic domains where it might be 
tested. As a preliminary step to testing these ideas, this chapter is a scoping exercise 
looking chiefly at the structural properties of a subset of multiword expressions 
(MWE). It shows, following Barkema (1996), that many properties of MWEs cross- 
classify. The aim of the overview is then to examine domains of the morphosyntax 
ofany language which might be analysed for sources of structural idiosyncrasy and 
thus to determine how individual languages might vary in this respect. Languages 
of exemplification are English, which has a relatively fixed word order and slight 
inflectional system, and, to a lesser extent Dutch and Maori, an Oceanic language. 


1 Introduction 


In the traditional grammar-lexicon model of human linguistic knowledge, the 
grammar accounts for the regularities in the language which a native speaker is 
taken to have acquired and thus the predictabilities in its sentences. The lexicon 
has traditionally been 'an appendix of the grammar, a list of basic irregularities; 
(Bloomfield 1933: 274). While this distinction is increasingly contested, it will 
be maintained here. It is, in any case, an open question as to just where the 
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boundary between the grammar and the lexicon lies and, more significantly for 
what follows, what kinds of “basic irregularities' are possible; that is to say what 
kinds of idiosyncrasies can be expected to occur in the lexical items of a language. 

Bloomfield's characterization raises the question as to the kinds of basic irreg- 
ularities which might be found in the lexicon of a language. They appear to be of 
two types. Some irregularities are exceptional in cases where the grammar pro- 
vides options but only one is taken in a particular lexical item. That does consti- 
tute an idiosyncrasy by way of arbitrary restriction but the rules of the grammar 
are not breached. Obligatory truncation is such a case as in (5) and (6). Phonetic 
truncation is an option the grammar provides but in (5) and (6) the MWEs are 
always truncated. Restricted collocations do not break any rules of the grammar. 
They provide arbitrary paradigmatic restrictions on linguistic choices when the 
grammar allows for a larger set of choices to be made. In such cases the gram- 
mar is permissive but the lexicon is restrictive. A second class of irregularities 
is the result of breaches of the grammar where the constraints imposed by the 
grammar do not provide alternatives. Here the grammar is restrictive but the 
lexicon is more permissive. Some borrowed words, for example, may breach the 
phonological constraints of a language. English phonotactics do not allow the 
onset sequence /fn/. But the dog breed of schnauzer is lexicalized in English with 
this initial cluster. Let us call exceptions of the first kind where they are manifest 
Type 1 idiosyncrasies and exceptions of the second kind Type 2 idiosyncrasies. 

I now propose a hypothesis to link the grammar to its exceptions. The follow- 
ing hypothesis, the Law or EXCEPTIONS, is proposed as the strongest compatible 
with the distinction between the grammar and the lexicon. 


Law or Exceptions: All formal properties of the grammar of a language 
are subject to exceptions manifested in idiosyncrasies in the lexical items 
of that language. 


The Law of Exceptions thus predicts that the lexicon of a language will contain 
lexical items which break every rule in the grammar of that language. This is 
in line with the view of Di Sciullo & Williams (1987) with the metaphor that 
the lexicon is like a prison in that its inmates have all broken one or other law 
(although Di Sciullo and Williams do not suggest all laws are broken by at least 
someone). 

Note that it cannot be assumed that the Law of Exceptions is prima facie true. 
It might well be that there are areas of the grammar of a language and perhaps all 
languages where there are no exceptions, i.e. that there are laws that are never 
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broken. Tensed verb second placement in main clauses of many Germanic lan- 
guages is absolute as is suggested in the later discussion around examples (48)- 
(52). The verb second constraint in Dutch and German may be such an instance. 
The Law of Exceptions is therefore testable against all lexical items in the lexicon 
of a language. 

This chapter focuses on the structural properties of a subset of lexical items, 
MWES that have syntactic structure. Such lexical items may vary in many ways, 
so an account of the ways in which these properties can vary in general is use- 
ful, not least as a checklist for languages whose phraseology has not been docu- 
mented. In the case of the languages of exemplification it will be shown that in 
every syntactic domain covered in this chapter, where there are regularities of 
both Type 1 and Type 2 in the grammar, there are exceptions in the lexicon. 

Since the Law of Exceptions provides a relationship between the grammar of 
a language and its lexicon it is important initially to determine where exceptions 
cannot in principle be found. The prediction is that this will only be the case 
where a grammar has no regularities. For example, in the domain of morphology 
it is often considered that Chinese languages have no derivational morphology.! 
If that is the case, then the idiosyncratic properties manifested in the derivational 
morphology of derived words in other languages which do have derivational 
morphology are not in evidence in Chinese languages and so, obviously, are not 
available for analysis as to their idiosyncrasies. In the domain of syntax, since 
the syntax of a language determines what kinds of syntactic idiosyncrasies are 
possible, if the syntax of a language has antipassive voice, then there may be 
MWES which exist only in the antipassive form or not in the antipassive form 
even when it is plausible that they should.? But, since only ergative languages 
have an antipassive voice, this syntactic property places a limit on the kinds of 
idiosyncrasies which can be expected in the lexical entries of lexical items in a 
particular language. 

Since the Law of Exceptions applies to the current synchronic grammar, all 
MWES which were historically unexceptional but are exceptions to the synchron- 
ic grammar fall under the Law of Exceptions. The reason for this is that native 
speakers ofa language may be presumed to have internalized only the synchronic 
grammar of their native language (historical linguists excepted). 


!But see Starosta et al. (1997) for a contrary view. 

“It is likely that, at least for some grammatical rules, there is more than one way in which they 
may be violated. Take for example, the English passive. It may be that an MWE is only possible 
with a get auxiliary and not a be auxiliary or that the agent which is in an oblique position 
in a passive MWE cannot be deleted. The Law of Exceptions, therefore, needs to note in how 
many ways a grammatical rule might be breached. It is an open question whether all possible 
breaches have associated MWEs. 
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Turning now to MWES, if we suppose following Sag et al. (2002) that MWEs 
are lexical units of more than one word, then in this chapter the analysis of the 
properties of MWEs will be restricted to those of the subset of phrasal vocabulary, 
Le, MWEs having syntactic structure. The Law of Exceptions predicts that the 
lexicon of a human language will always contain an inventory of MWEs with 
grammatical structure since all languages have syntax. Such lexical items are in 
the mental lexicon because they have one or more idiosyncrasies, i.e. properties 
which cannot be predicted by the grammar of the language. That is why they are 
stored and retrieved rather than computed (Bresnan 1981)? Such lexical units 
are elsewhere termed, amongst other things, 'phrasemes' (Mel’cuk 2012) or are 
a subclass of ‘morpheme equivalent units’ (Wray 2008). 

It follows that the structural properties of compound words will not be ex- 
amined. This is because opportunities for structural variation in compounds are 
relatively slight (Selkirk 1982) while in the subset of MWEs with syntactic struc- 
ture, opportunities for idiosyncratic variation of structural properties are consid- 
erable since the syntaxes of natural languages are complex and thus offer many 
opportunities for syntactic idiosyncrasy. 

MWEs have two kinds of properties: digital properties such as having an oblig- 
atory plural in some instances, and gradable (analogue) properties such as their 
degree of semantic compositionality. Analogue properties can of two kinds: the 
MWE has a particular property to a greater or lesser degree or the MWE has the 
property some of the times it is uttered but not at other times. 

Viewed diachronically, MWEs may exhibit idiosyncratic properties that were 
not idiosyncratic at some time in the past. Some of the many English MWEs 
which are originally quotations from Shakespeare and the King James bible trans- 
lation, often termed “winged words’ in continental phraseological manuals (Glä- 
ser 1986) have such idiosyncrasies. They are not alone as can be seen by (1). 


(1) will he nill he 
a. originally: will he ne will he 
will he not will he 
b. now truncated further to: willy-nilly* 
'regardless of what one might wish' 


That is not to say that they are unanalysable, as hybrid theories of speech production such as 
those of Cutting & Bock (1997), Titone & Connine (1999) and Sprenger et al. (2006) propose. 

“Such archaisms have been noted in the inventory of MWEs for sources as various as Homeric 
epic (Lord 1960) and livestock auctions (Kuiper & Haggo 1984). 
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What general sources are there for the idiosyncratic properties of MWEs? The 
idiosyncratic properties of MWEs have three sources: 


1. properties they have by virtue of being lexical items; 
2. properties they have by virtue of being structurally complex; 


3. properties they have by virtue of being phrases.” 


The subset of MWEs which I will examine, as I have indicated above, may be 
defined on the basis of their structural properties, namely that they have syn- 
tactic structure. They may have other properties which may cross-classify. Sag 
et al. (2002) see idiomaticity as definitional for MWE. I do not regard semantic 
non-compositionality as a definitional criterion for MWE since it is shared with 
derived words (Jackendoff 2002) and, although many MWEs are idioms, many 
are not (Mel'éuk 2012). In example (2), 


(2) infidelity 


has a narrowed sense of “marital infidelity”. 
Having associated conditions of use is also a cross-classifying property since 
mono-morphemic words may also have associated conditions of use as in exam- 


ple (3). 
(3) Thanks! 


The property of being a restricted collocation is common to compounds and 
idioms. In a compound, since both constituents are lexicalized they must be re- 
stricted against one another. 

That being the case an MWE can be a restricted collocation, semantically non- 
compositional, and have associated conditions of use as in example (4). 


(4) I declare the meeting open. 


Example (4) is an MWE. It is a restricted collocation. Open has a somewhat 
specialized sense and the whole expression is a formula used by the chair of a 
meeting to begin the formal proceedings of a meeting. 

A classification of all lexical items on the basis of structural properties (which 
do not cross-classify) can be given as in Figure 1. 


"Here phrases are to be understood to include clauses and sentences, i.e. a sequence of words 
having syntactic structure. 
°This is also the approach used by Fiedler (2007). 
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lexical items 
structurally simple structurally complex 
word level complexity syntactically complex 
derived words compound words phrasal lexical items 


| | | | 


cat decision lighthouse the White House 


Figure 1: Structurally based classes of lexical items. 


2 Idiosyncratic properties of MWEs 


Many of the properties described below are described and exemplified for Ger- 
man by Burger (2010) and in Jaki (2014). 


2.1 Idiosyncratic properties of MWEs which they have by virtue of 
being lexical items 


Such properties are shared with structurally simple lexical items. 

All MWEs may have phonological idiosyncrasies. This is probably a digital 
property. For example, MWEs may be lexicalized with idiosyncratic phonetic 
realizations, e.g. obligatory truncation as in (5). As noted above, this is a Type 1 
idiosyncrasy. 


(5 She'll be right. 
#She will be right. 


An Australian MWE indicating that there is nothing to be concerned about, and 
(6) Good day. 


an Australian English greeting being conventionally realized as [gidei]. 

MWEs may also have idiosyncratic intonation contours, e.g. livestock auction 
formula (Kuiper & Haggo 1984), market cries and classroom greetings by ele- 
mentary school children to their teacher in Australia and New Zealand which go 
at half normal articulation speed and have a distinctive tune on the formula as 
in (7). 
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(7) Good morning, Miss/Mrs/Mr X. 


Good morning, Mister Jones 


Figure 2: Primary school greeting formula tune. 


(8) Maori 
Tihei mauri ora 
sneeze spirit life 


“the sneeze of life 


(8) is used by a speaker when taking the floor during Maori oratory. 

The formula has an intonationally raised and prosodically drawn out syllable 
on hei and a quicker than normal downward intonation contour on the remaining 
syllables of the phrase. This is also a Type 1 idiosyncrasy since such intonation 
contours are possible within the grammar. 

Any lexical item may have conventional conditions of use such as (9)-(11). 
This is a digital Type 1 property since the grammar has nothing to say about the 
usage conditions of lexical items. Such lexical items are often termed formulae 
or routine formulae (Coulmas 1979). 


(9) Sorry. 
a single word apology, 
(10) Bullshit! 
a compound word exclamation of disbelief, 
(11) Ifit please Your Honour. 


an MWE used by legal counsel seeking approval for a course of action of a pre- 
siding judge in a court 
An example from Maori is the formula in (12). 
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(12) kapiti hono, tātai hono 
join connect recite connect 


This is a bridging formula used to transition from the acknowledgement of the 
dead to greetings to the living in formal speechmaking. 

While I have designated this property as digital, i.e. an MWE either does or 
does not have specific conditions of use, the specific conditions of use are them- 
selves complex and not necessarily digital (Biber 1994). The contexts of use will 
also range from the general to the specific. 


2.2 Idiosyncratic properties MWEs may have by virtue of their being 
structurally complex 


Such properties are shared by structurally complex words. 

Itis possible for a MWE to exhibit morphological and/or morphosyntactic idio- 
syncrasy. The extent to which this is possible depends on the inflectional and 
derivational morphology of a language.’ Languages with extensive inflectional 
morphology such as Turkish would be expected to exhibit inflectional idiosyn- 
crasy in their MWEs. Chinese languages in the absence of inflectional morphol- 
ogy cannot. For example, in English an MWE may have an obligatory singular 
when a plural is semantically plausible as in (13) or an obligatory plural as in (14) 
when a singular is plausible. These are Type 1 idiosyncrasies. 


(13) give someone a hand 
'assist someone' 
#give someone a pair of hands 


(14) as scarce as hens' teeth 
"very scarce' 
#as scarce as a hen's teeth 


MWESs may also have idiosyncratic derivational morphology. In the MWE in 
(15), 


(15) the use of undue force 
or: to use undue force 


"Note that it is rare for left hand constituents of English compounds to have inflections even 
when this is warranted semantically as in head count ‘the counting of heads’. This restriction 
on the appearance of inflections appears to be a requirement of the word formation rules of 
English and thus not an idiosyncratic property of individual compounds. 
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this formula is used of police arrests in particular. The morphological idiosyn- 
crasy is that there is no equivalent with due force (although there are no doubt 
situations where due force is applied). 

In Maori, the expression 


(16) mëng (noa ake)te koree V(o NP) 
PREP-3SG (just up) DET NEG TAM V (PREP NP) 


Lit. (His/they, ...) not V-ing is (just) his fault. 
“NP is certain to V’ 


occurs in the example sentence in (17). 


(17 Ma-na  noaakete koree pai o ta tātou rārangi 
PREP-3.SG just up DET NEG TAM good of sc.A L.PL.INC line 


‘Just because of him our lineout was no good? 


This is from a piece of talk about rugby football in which lineouts are a set move.” 


This MWE has the restriction that it always contains mana, with 3“ person sin- 
gular agreement, whatever the number ofthe subject NP. This is a case where the 
VP has a frozen morphology and agreement does not operate across the bound- 
ary between the open slot of the subject and the number inflection of the verb. 
Since the general rules of agreement do not operate in this case, this is a Type 2 
idiosyncrasy. 

An MWE may also retain as an idiosyncratic property an inflection which is 
no longer available in the current language. The Dutch MWE in (18) 


(18) des duivels 
the.GEN devil.GEN 


“to be very angry’ 


exhibits a genitive inflection on the article which is no longer current. This is a 
Type 2 idiosyncrasy since (18) exhibits a breach of the current rules of inflection. 

It is also possible that an individual word may have different morphosyntac- 
tic properties in an MWE than it does elsewhere. An anonymous reviewer has 
given the following example from French. “To cite from Grevisse (Bon Usage: 
198) "Orge est féminin, sauf dans les deux expressions orge monde, orge perlé.” 


*Detailed glosses: Mana ‘for him/her’; noa ake ‘just/merely’; tà tātou = ‘ours (first person inclu- 
sive)’, i.e., ‘belonging to all of us’. 
?Source: http://www.nzherald.co.nz/nz/news/article.cfm?c_id=1&objectid=10587508. 
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So a word may be feminine, except in certain fixed expressions! Here presum- 
ably agreement with the gender of the noun would be un-idiosyncratic. This is 
a Type 1 idiosyncrasy since the grammar of French makes two genders available 
and the assignment of gender to individual words is (in part) arbitrary. 

All structurally complex lexical items can have bound forms as constituents, 
derived words such as agog, compound words such as wardrobe (Richter et al. 
2010). In MWEs the following examples show the presence of bound words:!" 


(19) Maori 
mängere honia 
lazy very 
‘very lazy' 
(20) beontenterhooks 
“to be in a state of agitation about a future event’ 


(21 take umbrage at 
“take offence at’ 


(22) kith and kin 


‘relatives’ 


Being a bound word is a Type 1 idiosyncrasy since no rule of the grammar is 
breached by the fact that a word is bound within an MWE. 


While this chapter will not specifically deal with the semantic idiosyncrasy of 
MWES, it is useful to offer a few remarks on that property since it is shared with 
structurally complex lexical items such as derived and compound words. Seman- 
tic idiosyncrasy appears to be an analogue property.!! Semantic idiosyncrasy can 
come about in a number of ways. As Jackendoff (1975) points out, many English 
derivational affixes are polysemous. In particular words, however, only one of 
the senses associated with the affix is part of the compositional reading of the 
word as a whole. Such selective compositionality occurs where not all the cross- 
product senses of affixes or words are part of the sense of the whole expression. 
In an MWE such as (23), 


(23) BE in stock 
“be part of the current inventory of a shop or warehouse’ 


Tn (19), hönia is a bound form occurring only as a modifier of mängere lazy”. 
"See Burger (2010) for a useful introductory discussion. 
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the word stock does not have the sense of liquid used for soups and sauces’ 
but “inventory”.* This is a Type 1 idiosyncrasy since such a reading is a possible 
reading but not the only possible one. 

Non-compositionality can occur where the sense of a word in a lexical item 
does not occur when the word is used independently. This is not a matter of a 
breach of a grammatical or semantic rule. It would therefore be a Type 1 idiosyn- 
crasy. It is manifest in a MWE such as (24). 


(24) without let or hindrance 
“without any obstruction or interruption' 


The noun let has a now defunct syntactic category and sense." 


A second kind of non-compositionality occurs when the rules of the seman- 
tics of the language are breached as in conventional figurative expressions such 
as (25). These are Type 2 idiosyncrasies. In such cases structurally complex lex- 
ical items also share the potential property of having analysable semantic rep- 
resentations. In (25), the phrase is figurative with the gloss “accepting a difficult 
challenge or situation’, grasp being ‘accept’ and the nettle being 'the difficult 
challenge or situation’. 


(25) grasp the nettle 


3 Idiosyncratic structural properties a MWE may have by 
virtue of being a phrase 


In this section, I focus on those areas of potential structural idiosyncrasy which 
MWES have but which they do not share with other structurally complex lexical 
items such as derived and compound words. Each relevant area will show that 
the Law of Exceptions appears to be corroborated. 

All MWEs are associated with a phrase structural configuration. This is shown 
for one MWE in example (26), where the final NP is a slot (an open argument 
position). 


(26) [VP[V make][NP[DET the][N most][PP[P of ][NP]]] 
‘maximize the potential offered by ... 


"This selective compositionality may be the consequence of polysemy or homonymy. It can be 
difficult to separate these in particular instances. 
®t also has this sense in the term let in tennis, namely an obstruction. 
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Having grammatical structure is a digital property. That is not to say that such 
structures are always permissible in the current synchronic grammar of the lan- 
guage as in (27). 


(27) bethat as it may 
"whatever the actual case may be' 


Example (27) is in the subjunctive mood, a mood that no longer exists in the 
current synchronic grammar of English.'* This is a Type 2 idiosyncrasy. 

In (28), the syntax is calqued from a Chinese four character idiom (Kuiper & 
Tan 1989). 


(28) long time no see 
T haven't seen you for some time’ 


It can therefore be concluded that there are exceptions to the phrase structural 
regularities of the synchronic grammar. This is also a Type 2 idiosyncrasy. 

The phrase structure of MWEs may be further constrained by general con- 
straints that are lexicon internal in that not all the possible phrase structural 
configurations the grammar allows are to be found in MWEs. For example, gram- 
mars allow for recursive rules in their syntax. There is evidence of a degree of 
recursion in MWE as in (29). 


(29) a sight for sore eyes 
'a welcome appearance' 


In (29), there are two NPs one is the top NP while the other is embedded in 
a PP. However recursion is limited because MWEs are stored in a finite brain 
and so cannot be of indefinite length. How limited recursion in MWEs may be is 
an open question.? The Law of Exceptions is, however, corroborated as regards 
recursion in the grammar since this is a Type 2 idiosyncrasy albeit a general one 
since it is not just the property of a single lexical item. 

O'Grady (1998) proposes that the citation form of idioms in the mental lexicon 
is in the form of lexical selection by heads of heads within their syntactic domains 
thus forming chains of heads. Some of these requirements are interesting in that, 
while phrases must have heads, it is not a necessary property of MWEs that 
the head position should dominate a lexical item or, in the case of functional 
projections, a specific functional head. This constraint may itself have exceptions, 


“This is essentially a morphosyntactic property included here as a structural property. 
PHoeksema (2010), and Richter & Sailer (2009) discuss MWEs of clause length. 
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as suggested by an example from Maori as in (30), a formula expressing sympathy 
for someone's problem. 


(30) i wäna nei (hoki) 
at/by/from his/her NEAR-SPEAKER EMPHATIC-PARTICLE 


While i is the head of the PP, O'Grady's theory predicts that there must be a 
lexical head for i to select within the immediate domain of the PP, i.e. the head 
of the NP complement of i. Hoki is not a lexical head and is optional (therefore 
cannot be a lexical head). Wana ‘his/her’ can be replaced by an appropriate pos- 
sessive determiner, e.g. wau ‘your’, wa rātou “their”. There are however restric- 
tions on the choice of possessives. The possessive pronoun always starts with w, 
a possible initial phoneme for these possessives otherwise, but many speakers 
use it only here. Normally the possessive determiners have t- for singular pos- 
sessum, Ø- for plural, thus eg tana -äna. Some speakers allow w-, thus wana for 
plural. However all speakers use only the w-forms in this MWE showing an idio- 
syncrasy typically associated with MWEs. Furthermore the possessive (taking 
possessive to be determiners) has no NP complement. The determiner position 
is also always a possessive, i.e., i wōna nei can not occur in this MWE. I wona nei 
can certainly appear elsewhere given the right syntactic etc environment as in 
(31); just never in this MWE. 


(31) Henui ake oku waka i wöna nei. 
A big upwards my cars thanhis here’ 


"My cars are bigger than his. 


Nei is obligatory. While in the morphosyntax of Maori there are three locative 
particles, one never finds either of the other locative particles: na, 'near hearer’ 
or rá “over there’ in this MWE. So there are no lexical heads within the domain 
of the head position of this MWE and the non-heads are idiosyncratic in various 
ways. Thus, unless one uses an analysis allowing functional heads to serve in 
O'Grady's head chain proposal in which wana etc. are Determiners and thus 
functional heads, this MWE has no lexical head within the immediate domain 
of the head of the MWE.! This case therefore suggests that a strong form of 
O'Grady's proposal is falsified and the Law of Exceptions is corroborated. 


Ray Harlow (personal communication) provided this example and analysis. A case might be 
made for functional heads as well as lexical heads being predicted to be lexicalized in the case 
of possessives where the possessive marker could be regarded as a functional head of DP and 
where the NP within the possessive phrase is a slot. 
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The syntax of a language may require certain obligatory constituents, e.g. com- 
plements of transitive verbs or possessive NPs. In an MWE these may not be filled 
with lexicalized material as in (32) and (33). Where this is the case the idiosyn- 
crasy is lexical and is not a breach of the rules of the syntax. It is an idiosyncrasy 
because, while take is transitive, it is an idiosyncratic feature of the MWE that 
the NP object of take in take NP to task is not lexically filled as it is in take notice 
of. Thus these are Type 1 idiosyncrasies. 


(32) take NP to task 
‘hold someone responsible’ 


(33) get NP’s goat. 
‘annoy someone’ 


Such slots may be semantically restricted in idiosyncratic ways as in (34). 


(34) drop in on NP[+human] 
“visit someone unannounced’ 


Some MWEs have an optional but lexicalized constituent as in (35). Again 
this does not involve a syntactic irregularity since the rules of the syntax allow 
both configurations. In other words, while drop can have both human and non- 
human objects, drop in on can only have a human object. So these are also Type 1 
idiosyncrasies. 


(35) (keep poss-NP) fingers crossed 
‘hope for a good outcome’ 


Such optional constituents are either truncations as in (34) or they can be in- 
ternal to the MWE as in (36). 


(36) take (careful) note of NP 


In (36), careful is a highly preferred modifier and thus we can suppose that it 
is lexicalized but optional. Given that modifiers are permissible in general, this 
is a Type 1 idiosyncrasy. 

The distinction between slots and optional constituents is that slots are lexi- 
cally unspecified except for their syntactic category and are obligatory, optional 
constituents are lexically specified and optional while modifiable MWEs are op- 
tionally able to take any appropriate modifier. 

Conversely, in some MWEs, internal modification having scope over an inter- 
nal constituent is not permitted, for example in the case of (37). 
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(37) cut no ice 
“have no impact’ 


This cannot be modified, as shown in (38), 
(38) #cut no melting ice 


when the grammar would otherwise permit it." This suggests that modifiability 
properties can sometimes be absolute such as cases where modifiability is im- 
possible whereas for other MWEs it may be a preferred option with some highly 
favoured and other disfavoured modifiers. 

The presence of a lexicalized optional modifier in (36) is idiosyncratic but it is 
not a grammatical irregularity. It is therefore a Type 1 idiosyncrasy. The restric- 
tion against modification in (37) could, by contrast, be regarded as a grammatical 
irregularity, i.e. a Type 2 idiosyncrasy. That such cases should exist is a prediction 
of the Law of Exceptions. 

Where the syntax of a language allows a variety of related constructions for 
a similar argument structure, an MWE may only permit one or fewer than a full 
set of variations, e.g. double object constructions as in (39)-(42), passives as in 
(43) and (44). This is a Type 1 idiosyncrasy. 


(39) give NP the sack 

‘terminate NP's employment’ 
(40) #give the sack to NP 
(41) #pay something attention 


(42) pay attention to something 
‘note or concentrate on something’ 
(43) NP cut off his/her nose to spite his/her face 
‘act in a way that is detrimental to oneself out of pique’ 


(44) #John’s nose was cut off to spite his face. 


"The semantics of modifier constituents within MWEs is complex (Nicolas 1995). Nicolas sug- 
gests that a modifier placed internally to a MWE can have scope over the meaning of the 
whole expression. In (38) the internal modifier melting cannot be parsed as modifying ice. 
However there are other cases Nicolas regards essentially as adverbial in having scope over 
the metaphorical expression as a whole so that cut no real ice is parsed as "really cut no ice’ and 
cut no empirical ice is parsed as 'cut no ice empirically’. So the modification, while placed inter- 
nally, is not semantically a modification of the constituent the modifier is predicted to modify 
in a compositional syntax. These cases are thus idiosyncratic. In that sense the placement of 
the modifier is structurally idiosyncratic. 
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For each set of syntactic alternates of this kind there may be MWEs which are 
idiosyncratic in allowing only one of the two possibilities where the grammar 
would predict that both might occur. For example, a double object construction 
may be lexicalized in either one or other form as in (39)-(42), or both as in (45) 
and (46). 


(45) give credit to NP (for) 
'give positive acknowledgement to someone for something’ 


(46) give NP credit (for) 


Such distributions may well be matters of degree. 


In an MWE the antecedent of a pronominal or reflexive can be more restricted 
than the syntax of a language requires. Such cases can be seen as slots which 
are restricted to pronominals with additional slot restrictions as regards the an- 
tecedent of the pronominal. This is a Type 1 idiosyncrasy. 

In (47), the antecedents of the possessives must be the agent arguments of dig 
as in (47a) and (47b). 


(47) dig one's heels in 
‘resist’ 
a. Jane dug her heels in. 
b. # Jane dug Fred's heels in. 


An MWE can have argument structure which is different from that of its head 
verb as in (48). This again a Type 1 idiosyncrasy since the grammar allows for 
predicates to have various argument structures. 


(48) raining cats and dogs 
‘raining heavily’ 


Rain is a zero place predicate but in (48) it is apparently a one place predicate. 

MWESs merge into the constructions of Construction Grammar. They are of 
two kinds: lexically motivated constructions, e.g. the let alone construction (Fill- 
more et al. 1988), and syntactically motivated constructions, e.g. irreversible bi- 
nomials (Malkiel 1959). It is an open question whether the latter belong in the 


PFraser (1970) hypothesizes that there is a hierarchy of frozenness in construction types while 
Nunberg et al. (1994) propose that the degree of syntactic flexibility is related to the degree of 
compositionality of the MWE. 

“This could be seen as a case of a lexicalized internal accusative such as one gets with It snowed 
a blizzard. 
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phrasal lexicon or in the grammar. The former clearly do belong in the lexicon 
given that they have lexical content. 

As suggested above, the grammars of languages place limits on phraseological 
variation. Typologically different languages will therefore be predicted to give 
different ranges of idiosyncrasies for MWEs. One typological distinction, is that 
between what are termed free word order languages but are more accurately free 
phrase order languages such as Warlpiri and Latin.?% Questions which are yet to 
be answered about such languages is what the underlying form of the syntactic 
representation of the MWEs of such languages might be. How flexible are their 
MWES given that the languages themselves have relatively free phrase order? In 
turn what idiosyncrasies might their MWEs display in the relevant areas of the 
grammar?” 

What of languages with the typological character of so called verb second 
(V2) languages such as Dutch and German where the canonical order - within a 
generative framework - in main clauses is I second but in subordinate clauses I 
last? German and Dutch phrasal dictionaries list VP idioms with the verb in VP 
final position although in main clauses the tensed verb will be in second position. 
For example, a Dutch MWE as in in (49) is verb last in subordinate clauses as in 
(50) but verb second as in (51). 


(49) van NP houden 
from someone/something hold 


‘love someone/something’ 


(50) Ik dacht dat ik van mijn aapje houden wou. 
I thought thatI from my little monkey hold would. 


‘I though that I would love my little monkey’ 


(51) Ikhield van mijn aapje, 
I held from my little monkey 


‘I loved my little monkey’. 


Verb second placement is also obligatory if the verb is the head ofa VP MWE.?? 


Ray Harlow (personal communication) indicates that no dictionary of phraseological units 
exists for Latin and Michael Walsh (personal communication) knows of no dictionaries of 
phrasal vocabulary for any aboriginal language. 

? Michael Walsh and Maia Ponsonnet (personal communication) know of no studies that might 
assist in answering these questions. 

This phenomenon is also discussed by Schenk (1995), Nunberg et al. (1994) and Bargmann & 
Sailer (2018 [this volume]). 
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(52) Ikdacht dat ik het gewoon uit mijn duim zuigen kon. 
I thought thatI it just outmy thumb suck could 


‘I though I could just make it up: 


(53) Ikzoog het gewoon uit mijn duim. 
I suckedit just outmy thumb 


‘I just made it up: 


What is the order in German and Dutch of the verb plus its complement for 
such MWEs in the mental lexicon? Is there an order at all or are there only de- 
pendencies? This is not a question of flexibility under movement. Verb second 
placement in main clauses is obligatory. Are there Type 2 idiosyncratic mani- 
festations of these regularities in the MWEs of German and Dutch? The Law of 
Exceptions predicts that there should be cases in Dutch and German of MWEs 
which have main clauses where the tensed verb is not in second position. 

Beyond the Law of Exceptions lies a further question as to the preponderance 
in the lexicon of a language of particular classes of exceptions. It is possible that 
different languages make different selective use of the parameters of variation 
noted above, e.g. some languages might have more bound words than others 
(Dobrovol’skij 1988). 


4 Conclusion 


The foregoing provides an outline of a set of structural properties of a grammar 
which have the potential to have exceptions and thus give rise to idiosyncratic 
structural properties of MWEs. It has been proposed that such exceptions are of 
two kinds: those which are ‘basic irregularities’, ie. which are in breach of the 
rules of the grammar and which I have termed Type 2 idiosyncrasies, and id- 
iosyncrasies which are the result of arbitrary restrictions on lexical items where 
the grammar makes no stipulation about such restrictions. Where such idiosyn- 
crasies appear in lexical items, these have been termed Type 1 idiosyncrasies. By 
classifying idiosyncrasies on the basis I have, it also seems that Type 1 idiosyn- 
crasies may be more common and diverse than Type 2 idiosyncrasies and that 
the lexicon is not as full of seriously lawless inmates as the prison metaphor in 
Di Sciullo & Williams (1987) suggests. Perhaps there is a maximum security wing 
for Type 2 inmates and a less secure set of cell blocks for Type 1 inmates whose 
deviance is by way of arbitrary restriction. 
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Chapter 6 


Choosing features for classifying 
multiword expressions 


Éric Laporte 


Université Paris-Est, Laboratoire d'informatique Gaspard-Monge CNRS, France 


Multiword expressions (MWEs) are a heterogeneous set with a glaring need for 
classifications. Designing a satisfactory classification involves choosing features. 
In the case of MWEs, many features are a priori available. Not all features are 
equal in terms of how reliably MWEs can be assigned to classes. Accordingly, re- 
sulting classifications may be more or less fruitful for computational use. I outline 
an enhanced classification. In order to increase its suitability for many languages, 
I use previous works taking into account various languages. 


1 Introduction 


Multiword expressions range from idioms like put pen to paper, meaning 'under- 
take to write something’, to multiword terms like protein kinase to support-verb 
constructions like take a dip ‘bathe’ and other types. Due to such diversity, there 
is a glaring need for classifications, if only for practical organization and for ne- 
cessities of communication within the research community. Forty years after the 
first published comprehensive classifications of sets of MWEs, the community 
has not reached a satisfactory consensus on large classes or on the most relevant 
features. One outline of a classification (Sag et al. 2002), based on Nunberg et al. 
(1994), is influential, but some classes are fuzzily defined. The community is seek- 
ing to delineate the basic objects of the field. This uncertainty confuses computer 
scientists’ main MWE-related activity, which is to recognise types of MWEs in 
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texts through statistical engineering: the community does not offer a consensual 
definition of types of MWEs.! 

Classifications are a matter of features of the items to be classified. Which 
features should be used for classification, and therefore investigated in priority? 
Of course, linguistic relevance plays a prominent role in this selection, but my 
point in this paper is that many researchers overlook other important reasons 
for selecting or discarding some kinds of features. Some features are fuzzy and 
imprecise, that is, it is difficult to tell which MWEs have them. In resulting clas- 
sifications, assignment of MWEs to classes is less reliable than it could be, and 
this is detrimental to computational use. Other features are more clear-cut and 
potentially more useful, but have not made their way to computational-linguistic 
literature yet. Another requirement for a convenient classification is that its out- 
line be suitable for many languages. Accordingly, I use previous work taking into 
account various languages. 

In 82, I exemplify and discuss the notion of a fuzzy feature. In 83 and 84, I 
investigate two connected topics: clusters of correlated features, and practical 
problems of observation. $5 advocates in favour of the practice of checking in- 
formation against the lexicon. I outline an enhanced classification in $6. 


2 Clear-cut or fuzzy features? 


2.4 Examples 


Some features are more clear-cut than others. For example, some MWES select a 
preposition for a free slot/argument position,? as in have pity on: 


(1) You could have pity on us. 


Nothing is totally definite in linguistics, but using on in this context is clearly 
appropriate. 

In contrast, the semantic weight of verbs is a much fuzzier feature that lies in 
a continuum. The verb have in (1) is deemed "light", whereas it has full semantic 
weight in (2): 


(2 They will have this machine in soon. 


"Ihis machine will soon be available for sale in their store: 


"However, there is a relative consensus on the delimitation of MWEs themselves. At least, many 
experts agree that this class includes collocations, multiword terms and support verb construc- 
tions. I will not address this issue in more detail for lack of space. 

?Here, free means that the content of the slot, i.e. the noun phrase, is variable. 
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This is a basis to classify have pity in (1) as a support-verb construction, or light- 
verb construction, and have in in (2) as a phrasal verb. But, in have a call “talk on 
the phone’, or have a goal, or make a joke, intuition about the semantic weight of 
the verb in these expressions remains unsettled or depends on whom you ask.? 


2.2 Related work 
2.24 Earlier work on clear-cut features 


All the main features for present classifications had already been proposed by 
1995, so the historical background is worth reviewing. 

The first research works on MWEs with extended classificatory results define 
classes and subclasses with relatively clear-cut features. For instance, Labelle 
(1974)’s study of French support-verb constructions with avoir ‘have’ assigns a 
class to expressions with an argument position introduced by the preposition sur 
on’, as in: 


(3 Lyon a un avantage sur Marseille. 


‘Lyon has an advantage over Marseille.* 


This kind of sharp distinction neatly separates classes. For example, avoir un 
faible pour “have a taste for’ definitely does not select sur, since sequences like 
(4) are rejected.? 


(4) * Fai un faible sur toi. 
Lit. Thave a taste on you? 


The other features used to define Labelle's classes are similar. Many features 
come down to applying elementary syntactic operations, one at a time, and judg- 
ing the acceptability of the result, while watching out for unexpected meaning 
changes. The method used by Labelle, called Lexicon-Grammar (LG) by Guillet 
& La Fauci (1984), is briefly described by Gross (1994). It was applied to MWEs by 


? Another fuzzy feature of MWEs is whether they belong to terminology. Protein kinase does, 
smooth operator ‘persuasive person; manipulative person’ does not; but sore throat “inflamma- 
tion of the throat’ is somewhere in between, since it is used by professionals but mainly to 
communicate with non-professionals. 

“For examples not in English, I do not provide glosses because they would not be useful for the 
reader. I provide a translation of the literal meaning when it is different from the non-literal 
meaning. 

‘Independently of that, avoir un avantage may also occur with other prepositions, maybe less 
clearly selected: J'ai un avantage par rapport à toi. 'I have an advantage as compared to you! 
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other authors since then (Meunier 1977; Giry-Schneider 1978; Danlos 1980; Gross 
1982; Freckleton 1985; Machonis 1985; Ranchhod 1990; etc., in English, Romance 
languages, Greek, etc.). All prefer clear-cut features such as: 


e parts of speech (multiword nouns; verbal, adverbial and adjectival idioms) 


e applicable syntactic operations, including optionality vs. compulsoriness 
of fixed constituents and free slots. 


Some examples of clear-cut features are less likely to occur at the top of a classi- 
fication tree: 


e phrase structure (e.g. number of fixed objects in a verbal idiom) 


e number of free slots, their selected prepositions, restrictions on what may 


fill them 


e compulsory coreference relations (e.g. in think on one’s feet ‘improvise a 
reaction quickly’, between the free subject and the possessive) 


Nothing is totally definite in linguistics, but the implicit rationale behind pref- 
erence for clear-cut features is that it is unwise to place poorly understood fea- 
tures in a decision tree, especially at its top. 


2.2.2 Earlier work on fuzzy features 


However, outside this LG trend, clear-cut features are readily mixed with fuzzier 
ones, even when defining large classes. The clear-cut features are essentially the 
same as above. The fuzzier ones often involve semantics or psycholinguistics.’ 

SEMANTIC WEIGHT is often used to define support-verb constructions, or light- 
verb constructions, such as have pity, have a goal, take a dip. This section will 
show that this definition relies entirely on fuzzy features, and so does the other 
naive definition. 


é Fraser (1970: 39) also proposes a classification of verbal idioms based on applicable syntactic 
operations. But he tests it on a sample of 131 idioms only (p. 40-41). In addition, he hypothe- 
sizes entailments between operations: for instance, if an idiom accepts passivization, it would 
also accept permutation of complements. His classification presupposes the entailments: when 
some of them are wrong for an idiom, there is no class for it. This is the case for the French 
idioms faire le jour sur (lit. make the day on) 'shed light on' and claquer la porte au nez de (lit. 
slam the door to the nose of) 'slam the door in the face of’, which accept passivization, but not 
permutation of complements. 

"Wisely, current classifications of MWEs avoid using terminologicalness to define main classes. 
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Baldwin & Kim (2010: 276) define light-verb constructions by the fact that their 
verb is “semantically bleached or ‘light’, in the sense that [its] contribution to the 
meaning of the light-verb construction is relatively small in comparison with that 
of the noun complement,’ that is semantically weak. This definition dates back 
to Jespersen: "[s]uch everyday combinations as those illustrated in the following 
paragraphs after have and similar ‘light’ verbs (...) are in accordance with the 
general tendency of modern English to place an insignificant verb, to which the 
marks of person and tense are attached, before the really important idea (...) I re- 
ally must have a good stare at her." (Jespersen 1942: 117). But in many occurrences, 
verbs are felt to lie somewhere in a spectrum of intermediate stages between sig- 
nificant and insignificant. And, even though the feature is polar, it is not scalar: 
there is no metrics according to which it would be possible to measure how close 
an item is to the poles of the range. 

Alternative views of support-verb constructions have been proposed. One of 
them is in terms of predicate-argument structure: "nouns that have character- 
istics of predicates" (Gross 1981: 32, my translation; Cattell 1984). By predicate- 
argument structure, I mean the concept borrowed from logic by linguists, who 
initially applied it (Tesniére 1959) to sentences such as: 


(5) The wire connects the device to the socket. 


In this analysis, the predicate-argument structure of (5) is “connect C wire, ‘de- 
vice’, ‘socket’), where the predicate is ‘connect’ and the arguments are ‘wire’, 
‘device’ and ‘socket’. The predicate does not necessarily match with a verb: 


(6) Everyone took a look at our project. 


Analysing (6) as “take ('everyone”, look”) or ‘take’(‘everyone’, ‘look’, “project') is 
not satisfactory, precisely because take is too weak to make sense as the core 
of a predicate-argument structure. If you analyse (6) as 'take look ('everyone', 
‘project’) instead, you consider that the predicate is take a look (or the noun look, 
which makes little difference, since take has features of a mere function word). 
Or, in other words, the noun look has valency two. On the basis of this type 
of analysis, support-verb constructions could be defined as those in which the 
predicate does not match with the main verb, but with a noun (a predicational 
noun, or noun that has valency) or another part of speech (PoS). Unfortunately, 
this definition still relies on a shaky semantic intuition: which part of a sentence 
matches best with the intuition of predicate? Take the following sentence: 


(7) | He made a joke. 
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In (7), is the verb make to be analysed as a "performance" predicate, or is it so 
light that the sentence is equivalent to He joked? 

Another alternative is suggested by examples from Jespersen (1942), which 
all involve deverbal nouns such as stare, and by the pairs of sentences explicitly 
pointed out by Harris (1964: 17-19): 


(8 a. Hetook a look at it. 
b. He looked at it. 


Could support-verb constructions be defined by the equivalence between the 
content verb (looked) and the support-verb followed by the deverbal noun (took 
a look)? This would be consistent with both previous definitions.? Unfortunately, 
the definition based on equivalence with a verb would exclude many expressions 
for which no equivalent verb is in use (Labelle 1974): 


(9) a Ilaeuun conflit avec sa famille. 
“He had a conflict with his family: 


b. * Il s'est conflité avec sa famille. 
Lit. He conflicted himself with his family. 


“He conflicted with his family: 


This is not desirable because (9a) otherwise behaves like a typical support-verb 
construction. It is syntactically and semantically similar, for example, to (10a), for 
which an equivalent verb is observed: 


(10 a. Ila eu une réconciliation avec sa famille. 
“He had a reconciliation with his family. 


b. Ils’est réconcilié avec sa famille. 
Lit. He reconciled himself with his family. 


“He was reconciled with his family: 
Here are parallel examples in English: 


(11) a. Hehasthe goal of getting rich. 
b. * He goals to get rich. 


* As for the definition based on semantic weight, if look is equivalent to take a look, little is left 
for take to contribute to the meaning. Now for the definition referring to predicate-argument 
structure: if looked is the predicate in (8b), its equivalent took a look should logically be con- 
sidered as the predicate in (8a). 
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(12) a. He has the aim of getting rich. 
b. He aims to get rich. 


Reformulating, the property of equivalence with a content verb would not clas- 
sify (9a) and (11a) as support-verb constructions, in spite of their striking similar- 
ity with (10a) and (12a). 

Thus, we are left with the first two naive definitions of support-verb construc- 
tions: one with the semantic weight of the verb, and the other with predicate- 
argument structure. Both definitions rely on particularly fuzzy semantic intu- 
itions. They situate the feature of being a support-verb construction in a contin- 
uum between two poles. A more precise definition will be reported in 82.2.3. 

Gibbs & Nayak (1989: 104) define another loose feature, SEMANTIC DECOMPOS- 
ABILITY, as the "[contribution of] parts of idioms to their figurative interpreta- 
tions as a whole [according to] speakers' assumptions". For example, the parts 
of pull strings 'covertly use one's influence on personal connections' would be 
pull ‘exploit’ and strings ‘personal connections”. This is a continuously graded 
intuition: "People's intuitions about the decomposability of any idiom can vary 
along some continuum of semantic decomposition" (Gibbs & Nayak 1989: 67); “in 
general, idiom phrases exist on a continuum of analyzability ranging from those 
idioms that appear to be highly decomposable (e.g., pop the question) to those that 
can be viewed as semantically nondecomposable (e.g., kick the bucket)" (Gibbs & 
Nayak 1989: 107). Nunberg et al. (1994: 497, 508) reterm this feature SEMANTIC 
ANALYSABILITY and redefine it as the fact that the “idiomatic interpretation [can] 
be distributed over [the] parts of the [expression]". The wording is different, but 
it comes to the same thing, since the only source to know the distribution of 
the idiomatic interpretation over the parts of the expression is speakers' assump- 
tions? Nunberg et al. (1994: 520—523) cite an uncertain case: they represent take 
advantage of with two lexical entries, one of which is semantically analysable 
while the other is not, although they "know of no evidence that the two entries 
might be semantically different". This case where the same idiom, in the same 
sense, both is and is not analysable implicitly situates it at some intermediate 
stage. Although analysability is imprecise, Sag et al. (2002) and Baldwin & Kim 
(2010: 270) adopt this feature, going back to the term of SEMANTIC DECOMPOSABIL- 
rrY, to distinguish two of their major classes of MWEs: semi-fixed expressions 
and syntactically-flexible expressions. 


?Nunberg et al. (1994: 496-497) contrast semantic analysability with transparency, which is 
about speakers' ability to guess why an expression with some literal meaning is used to convey 
a given non-literal meaning. 

"Baldwin & Kim (2010: 270) equate their notion of decomposability to Nunberg et al. (1994: 
496)'s semantic analysability. 
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Thus, reputed classifications use fuzzy features as liberally as clear-cut ones, 
even at the top of their classification trees. 


2.2.3 Clear-cut features and lexical inventorying 


There is something more to be learned from early work on MWEs: extensive 
practice of lexical description leads researchers to discover clear-cut features and 
adopt them in their classifications. 

Recall that Labelle (1974) and other LG authors cited in 82.2.1 prefer clear-cut 
features. This specificity is connected to their practice of inventorying lexical 
items: they delimit a set of phrases on the basis of features, systematically record 
phrases belonging to the set, obtain comprehensive lists and study them in order 
to reach well-documented conclusions. The papers and PhDs of these linguists 
either include a comprehensive list of members of each class proposed, or at least 
were published after the completion of such lists.” For example, Freckleton (1985) 
lists 8000 English verbal idioms; by 1987, Gross' laboratory" had studied 12700 
entries of French predicational nouns used with support verbs (Tolone 2011: 144). 

This labour-intensive method contrasts with common practice of that time. 
Nunberg et al. (1994: 498), for example, do not challenge the LG approach or the 
resulting classifications,’ but base most of their research on sporadically picked 
examples: the only sizable lists reproduced in their paper (p. 532-534) answer 
empirically one of the many issues they address. When they claim that the num- 
ber of MWEs with anomalous morphosyntactic structure, like every which way 
“in all directions; in complete disorder”, is “not so small" (Nunberg et al. 1994: 515), 
as a reply to Chomsky (1980: 149) who claims such expressions are not "typical", 
they do not compare numbers of lexical entries in comprehensive dictionaries. 
(Tables of MWES settle this dispute in favour of Chomsky: morphosyntactically 
anomalous MWES are really a small minority.) 

Sag et al. (2002) and Baldwin & Kim (2010) share the same weaknesses. In 
no language did any research group assign semantic (un)analysability to com- 


"The lists still exist. They describe features for all entries and take the form of tables of features, 
which are easy to use. Many of these tables are freely available, e.g. at http://infolingu.univ-mlv. 
fr/ for French. They remain to be diffused so that they reach out to the mainstream community. 

"Laboratoire d'automatique documentaire et linguistique (LADL), a part of Université Paris 7 
and of CNRS. 

P? Nunberg et al. (1994: 498)’s divergence from Machonis (1985) is terminological: they call con- 
ventionality what Machonis (1985: 306) and Danlos & Gross (1988: 128-129) call lack of compo- 
sitionality, that is impossibility of predicting the meaning or use of the MWE on the basis of 
only a knowledge of the rules that determine the meaning and use of its parts when they occur 
separately. 
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prehensive classes of MWEs. Bond et al. (2015: 64) encode this property in 421 
English idioms, but this is a small sample, not a comprehensive lexical inventory; 
in contrast, Grégoire's study of 5000 Dutch MWEs led her to give up categorizing 
them as analysable or not (Grégoire 2010: 31-32). 

But what is the connection between clear-cut features and extensive lexical 
description? When the description of a feature gives clear-cut results through- 
out the inventory of expressions, authors understandably tend to consider these 
results particularly reliable, and to prefer this feature over others, all else being 
equal. 

This is how Gross and his followers in the 1970s came across formal features 
of support-verb constructions which are still used as criteria to recognise them 
(Langer 2005). They systematically scanned the lexicon of French nouns, studied 
their syntactic constructions and worked out criteria of recognition of predica- 
tional nouns for dubious cases. One of these criteria is a formal property of deter- 
miners and adjuncts (Gross 1976: 109) which is also observed in English. In (13a), 
if possessive determiners, phrases with of and genitives are inserted around joke, 
they cannot refer to anything else than the subject: 


(13) He made a joke. 


a. 
b. He made his joke. 

c. * He made your joke. 
d. * He made Ann's joke. 


How does this criterion correlate with the intuitive notion of predicate (cf. 
82.2.2)? In other sentences where the core of the intuition-identified predicate 
is a noun, like (8a) He took a look at it or (11a) He has the goal of getting rich, 
this constraint is also observed. But when the intuitive predicate is a verb, the 
constraint is not observed: 


(14 a. Hemade your car. 
b. He made Ann’s car. 
(15 a. He reported your joke. 
b. He reported Ann’s joke. 


This formal test is a reason for analysing (13a) as ‘joke’(‘he’), but (14a) as ‘make’( 
‘he’, ‘car’) and (15b) as report (Che, ‘joke’(‘Ann’)), in spite of their apparently 
similar structure. 
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This property, when used as a criterion to distinguish support-verb construc- 
tions from full-verb constructions, gives more precise results than those I men- 
tioned in 82.2.2. It does not help with the distinction between support-verb con- 
structions and verbal idioms. Some verbal idioms behave as (13): 

(16) a. He thought on his feet. 
“He improvised a reaction quickly: 
b. * He thought on Ann's feet. 


Others behave as (14) and (15): 


(17 a. Hemelted your heart. 
‘He made you feel sympathy’ 
b. He melted Ann's heart. 


But a similar property (Gross 1979: 865-866, Footnote 6), which is used in a forth- 
coming compendium on French grammar (Abeillé & Vivés 2011: 16), contributes 
to making more definite the distinction between support-verb constructions and 
verbal idioms. Take the following construction with the support verb make: 


(18) The quake made damage to the area. 


A syntactic operation applied to (19a) produces a variant (19b) where make is 
absent: 


(19) a. The damage to the area made by the quake (is described in the diary). 
b. The damage to the area by the quake (is described in the diary). 


This criterion classifies have the back of, meaning ‘back, support’, as a verbal 
idiom, not as a support-verb construction, because it has no variant in which 
have would be absent. Take the following sentence: 


(20) The president has the back of our children. 


A banal syntactic operation on (20) would produce the subject of the following 
sentence: 


(21) * The president's back of our children (is manifested by real actions). 


But (21) is not in use. 
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This criterion applies to all support-verb constructions. The syntactic opera- 
tions that remove the support verb are not the same for all support-verb con- 
structions, even in a given language: (19) exemplifies one of them for English, 
(20)-(21) another. Applying the criterion may involve knowing all operations 
and testing them, because the criterion rules out the support-verb construction 
analysis only if none applies, e.g. for have the back of! In Italian (De Angelis 
1989), Portuguese (Ranchhod 1990, Rassi et al. 2014), Korean (Han 2000), Greek 
(Kyriacopoulou & Sfetsiou 2003) and other languages, LG authors selected anal- 
ogous technical criteria and definitions. 

The larger linguistic and NLP community was not receptive to LG in the 1980s 
and 1990s, and access to publications was difficult. Cattell (1984), though he did 
not explicitly challenge the LG syntactic criteria, stuck to definitions based on 
semantic intuition, and so did many linguists. Since then, both traditions, that is 
intuitive vs. technical definitions of support-verb constructions, have continued 
in parallel.? Thus, support-verb constructions provide an example of a feature 
that can be defined either as clear-cut or as fuzzy. The notion itself is basically 
the same, but some definitions ensure more definite membership than others. 

This review of earlier work showed that researchers engaged in projects of 
extensive lexical description tend to discover more clear-cut features and adopt 
them in their classifications, but that recent literature does not make a difference 
between clear-cut and fuzzy features. Recent classifications are derived indiffer- 
ently from both. 


“This criterion also rules out the support-verb construction analysis for take advantage of: 

(i) The bank takes advantage of deposit slip errors. 

The syntactic operation of (19) does not apply: 

(ii) The advantage taken by the bank of deposit slip errors (has been revealed). 

(iii) "The advantage by the bank of deposit slip errors (has been revealed). 

Neither does that of (20)-(21): 

(iv) The advantage the bank takes of deposit slip errors (has been revealed). 

(v) "The bank's advantage of deposit slip errors (has been revealed). 


BThe use of light verb is loosely correlated with the intuitive approach and support verb with 
the technical approach. Gross (1981: 12) adopts the term of support verb, with the idea that, for 
example, make "supports" a predicational noun in (132). 
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2.3 Discussion 


I could not find in the literature any discussion of whether the use of these fuzzy 
features is an issue at all, or if their relevance compensates for their drawbacks. 
Even NLP literature does not care more than corpus linguistics or the philological 
or generative traditions. 

Fuzzy features can be less technical. For example, as opposed to the defini- 
tion of support-verb constructions based on the semantic weight of the verb, a 
concern for definiteness leads one to adopt formal criteria that involve applying 
detailed syntactic operations, as in (13c) *He made your joke, and assessing the 
result. It is not just that this complexity can be seen as a drawback: it also makes 
precise features more likely to be language-dependent. The criteria illustrated by 
(13) for English need to be adapted when they are applied to, say, Italian or Ko- 
rean, due to differences in syntactic constructions. Decisive criteria for support- 
verb constructions have been found in many languages, but they are not exactly 
the same. Butt (2010) also draws this conclusion on the basis of a review of ty- 
pological and diachronic literature. In contrast, the fuzzy definition based on the 
semantic weight of the verb is language-independent. 

But the price to be paid for language independence is that you cannot tell if 
an item satisfies the definition or not. Then, to which class does it belong? In 
practice, in order to avoid fuzzy membership of classes, and uncertain inclusions 
between classes, fuzzy features must be replaced by clear-cut, binary models of 
them. 

Proponents of fuzzy features rarely claim these might be useful for computer 
applications. Semantic analysability/decomposability, for instance, is relevant 
to the mental lexicon, maybe to first language acquisition, but probably not to 
computational applications. If, in the future, information is attached to parts of 
idioms in formal semantic representations, can it be exploited computationally? 
Nunberg et al. (1994: 501) cite a variation on tilt at windmills 'fight against imag- 
inary or invincible opponents": 


(22) tilting at the federal windmill 


One can imagine that a parser that would handle a semantic representation of 
tilt at windmills composed of separate structures for the parts tilt and windmill 
might interpret (22) more correctly than a parser with an atomic semantic rep- 
resentation of the idiom. But, as of 2018, both types of parsers are hypothetical, 
since few parsers interpret idioms. The challenge of identifying even their least 
creative variants, such as tilt bravely at windmills, is probably a priority as com- 
pared to that of unlexicalized, playful variations like tilt at a federal windmill. 
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Besides this remote perspective, computational applications of analysability are 
elusive. Bond et al. (2015) do not say that the analysability encoded in the dictio- 
nary is used by the English HPSG grammar to check restrictions on the form of 
idioms. 

No clear use has been found either for another fuzzy feature, which consists 
in the fact that native speakers that don't know the MWE can guess its sense or 
not when they hear it in an uninformative context. For example, according to Os- 
herson & Fellbaum (2010: 3), the sense of rest on one's laurels 'keep from making 
effort out of self-satisfaction from prior achievements' is easily guessed, but that 
of walk on eggshells ‘be very cautious’ is not. This feature is relatively fuzzy: 
"the classification suggested above is only approximate, a number of idioms (...) 
cannot be straightforwardly fit into any one category"; they do not claim any 
computational use of this feature, and it is hard to imagine a realistic one. 

In contrast, many clear-cut features are straightforwardly exploitable in lan- 
guage processing, especially those that directly determine the possibility of oc- 
currence of actual forms, such as selected prepositions or applicability of syn- 
tactic operations. Clear-cut features of MWEs, as described in LGs, have been 
used early and recently in Tree-Adjoining Grammars (Abeillé 1988), symbolic 
machine translation (Danlos 1992), finite-state parsers (Senellart 1998), symbolic 
dependency parsers (Tolone 2011) and statistical parsers (Constant et al. 2013). 

Clear-cut features have significant methodological and practical advantages. 
When fuzzy features are used instead, it is worth checking carefully that their 
linguistic relevance motivates this choice. 


3 Correlated features 


3.1 An example 


In some cases, it is easy to replace a single fuzzy feature with a bundle of clear-cut 
ones. For example, the syntactic operations applicable to verbal idioms, instead 


This feature is the more restrictive of the two features that Nunberg et al. (1994: 495) call 
conventionality. Osherson & Fellbaum term it non-compositionality, but it is quite different 
from lack of compositionality (cf. Footnote 13) in Danlos & Gross (1988: 128-129): when speakers 
that don't know rest on one's laurels figure out what it means, they don't base their guesses 
only on a knowledge of the rules that determine the meaning and use of the parts when they 
occur separately, but also on their imagination and cultural familiarity with classical antiquity. 
Nunberg et al. (1994: 496-497)'s transparency (speakers' ability to guess why an expression 
with some literal meaning is used to convey a given non-literal meaning) is less restrictive: 
even if speakers can't figure out the non-literal meaning of an unknown idiom, they may be 
able to guess the motivation of this meaning once they are informed of it. 
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of being collectively considered as a single (fuzzy) feature, are better dealt with 
separately from one another. 
Here are a few examples of syntactic operations. 


* Optionality of fixed constituents: 


(23) a. John bearded the lion in his den. 
John faced the danger directly. 


b. John bearded the lion. 
e Optionality of free slots: 


(24) a. John bears comparison to Magritte. 
“John is similar enough to Magritte to be likened to him? 


b. John bears comparison. 


e Insertion of free adjuncts: 


(25 a. This dealt a blow to my hopes. 
b. This dealt a strong blow to my hopes. 


e Topicalisation: 


(26) a. He would not deal such a blow to you. 
b. Such a blow, he would not deal to you. 


e Dative shift: 


(27) a. This dealt a blow to my hopes. 
b. This dealt my hopes a blow. 


* Reduction in a repeated occurrence: 


(28 a. Ichanged my mind about China. 
b. You changed yours about India. 
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e Pseudocleft construction: 


(29 a. John fights fire with fire. 


“John uses the same arms as his opponents: 


b. The way john fights fire is with fire. 
* Passivization: 


(30 a. The price of the coffee caught John short of change. 
“Given the price of the coffee, John had no change’ 


b. John was caught short of change by the price of the coffee. 


Not all operations are applicable to the same idioms. For example, bear com- 
parison to admits the removal of its prepositional free slot, but not passivization: 


(31) * Comparison to Magritte is borne by John. 


Fraser (1970: 34) is aware of these differences between features. The straightfor- 
ward model for syntactic flexibility is a multidimensional space of variation, since 
there are independent features. Nunberg et al. (1994: 509) claim that some large 
range of syntactic operations is "loosely correlated". Contrasts such as (24a) vs. 
(31) show that such correlation, if it exists, is not 100%. Since correlation is a statis- 
tical notion, only statistical evidence could support Nunberg et al. (1994)'s claim. 
This could be done by measuring the correlation with extensive lexical data. For 
example, Freckleton (1985)'s table C1P2, which contains beard the lion in his den, 
shows a positive but loose correlation between optionality of the prepositional 
object and passivizability: the Pearson's correlation coefficient, computed with 
these data, is 0.13. 

Sag et al. (2002: 6), Baldwin & Kim (2010: 278) cluster all syntactic operations 
into SYNTACTIC FLEXIBILITY. Syntactic flexibility is an imprecise feature, since id- 
ioms that undergo only some ofthe known syntactic operations are intermediate 
cases, "syntactically flexible to some degree", as Sag et al. (2002) put it. But they 
do not measure the correlation either. Even so, both Sag et al. (2002: 3) and Bald- 
win & Kim (2010: 279) derive some of their major classes from “the” feature of 
syntactic flexibility. Baldwin & Kim (2010) subdivide verbal idioms into two sub- 
classes: one of non-decomposable idioms “with hard restrictions on word order 
and composition", i.e. no application of syntactic operations, and another of de- 
composable, syntactically flexible idioms. They leave open the question of where 
intermediate cases should belong in practice: when some syntactic operations 
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apply and others don't, they cross-classify, i.e. they assign the same lexical item 
to different classes. None of these authors shows the benefit for NLP or linguis- 
tics of using a unidimensional scale as a model for a multidimensional variation 
space. 

These models artificially create fuzzy features and the associated problems. 
Their authors do not explain what motivates this innovation, nor do they state a 
position on previous classifications that avoid equating distinct syntactic opera- 
tions. 


3.2 Discussion 


Syntactic flexibility is a cluster of loosely correlated features and should not be 
used as if it were a single feature, especially when defining major classes: such 
definitions are imprecise. If a lexical database registers bear comparison to in a 
class of syntactically flexible idioms, this signals that this idiom admits at least 
some syntactic operations, but users cannot be certain about any specific op- 
eration, for example the removal of its prepositional free slot: (24b) John bears 
comparison. Reversely, if it is in a class of non-syntactically-flexible idioms, it 
does not admit all the possible operations, but users cannot safely deduce that it 
does not admit, say, passivization: (31) "Comparison to Magritte is borne by John. 
This compromises computational usage: a major function of a classification is to 
ensure that the members of each class have the corresponding defining features. 

Thus, as long as all properties are not securely established for all entries, it is 
a good practice to specify each criterion accurately. This leads to individuating a 
number of features, and to specifying which entries have which features, like in 
LG tables, which show that syntactic operations are not visibly more correlated 
in French (Gross 1982), Italian (Vietri 2011) or Greek (Fotopoulou 1993) than in 
English (Freckleton 1985). 

A correlation between features may give a sense that they are particularly 
relevant to classification, because they might stem from a hidden, underlying, 
fundamental property. Moreover, if these features are used jointly for classifica- 
tion, assigning an entry to a class will implicitly specify all features at the same 
time, killing two birds with one stone. However, the temptation should be re- 
sisted until a systematic investigation assesses how correlated the features are. 
If intuition overestimates the degree of correlation, which often happens, and if 
a classification equates loosely correlated features, assigning an entry to a class 
does not specify any of the features, failing to kill any of the proverbial birds. 

Fundamental scientific progress has often been achieved by elaborating a dis- 
tinction between two notions that are easy to confuse, e.g. weight and mass in 
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physics. In their model, Sag et al. (2002) and Baldwin & Kim (2010) do the reverse: 
they replace a set of relatively precise features, which are objectively distinct and 
had been treated as such before, with an imprecise one, which is therefore more 
difficult to handle. Their model adds artificial uncertainty. 

Thus, merging a cluster of loosely correlated features into an "aggregate" fea- 
ture decreases the accuracy of the model and weakens its information content. 


4 Reproducibility of observation of features 


4.1 Examples 


Even without any excess of optimism about correlations, some features are more 
clear-cut than others for another reason: reproducibility. Reproducible observa- 
tions are those inherently susceptible to high inter-judge agreement. This notion 
may sound technical and is often ignored, but my point in this part is that it has 
considerable practical significance for projects with realistic goals. 

Features are not equal in terms of inter-judge agreement. For example, the 
compulsoriness of a coreference relation in an MWE, e.g. in think on one's feet 
“improvise a reaction quickly’, between the subject and the possessive, is judged 
by checking the grammaticality or acceptability” of some sentences, which is 
relatively factual, as in (16) cited here as (32): 


(32) a. He thought on his feet. 
b. * He thought on Ann's feet. 


With melt someone's heart ‘make someone feel sympathy’, such coreference is 
not compulsory, as shown in (17a) He melted your heart. Few native speakers 
will disagree with such observations. Consequently, this feature can be recorded 
in a lexical database in a relatively reliable way. Similarly, the applicability of 
syntactic operations to MWEs is tested by applying the operations, as in (16), (17) 
and (23)-(31), while judging the acceptability of the result and the conservation 
of the meaning. Therefore, in most cases, it is also reproducibly observable. 


"By grammatical we mean that a sentence may be used to convey some information in some 
situation and in some context. This is consistent with how grammatical is used by most lin- 
guists, and identical to what Harris (1957: 293) means by acceptable (Ross 1979: 161). In fact, 
we will use acceptable, to avoid confusion with Chomsky's use of grammatical (Chomsky 1957: 
15), which is in principle divergent since some nonsense sequences may be grammatical in his 
sense. 
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In contrast, semantic analysability/decomposability in the sense of Gibbs & 
Nayak (1989), Nunberg et al. (1994: 508), Sag et al. (2002) and Baldwin & Kim 
(2010) has no other empirical ground than pure semantic impressions. "There 
are no well-defined procedures for specifying whether a given idiom is seman- 
tically decomposable or not” (Gibbs & Nayak 1989: 106). Nunberg et al. (1994: 
523-524)'s intuitions about the analysability of take advantage of are unstable 
(cf. 82.2). They cite take stock “take the time to think’ as analysable and take hold 
‘grasp’ as non-analysable, but other speakers’ introspection is not sure to repro- 
duce this contrast. According to them, take stock “can be roughly paraphrased 
as ‘make an assessment’, with the noun stock semantically approximating 'as- 
sessment'” But in the same vein, take hold can be paraphrased as ‘voluntarily 
acquire a grasp’, where take would denote the ‘voluntary acquiring’ and hold the 
‘grasp’. Acceptability is judged by introspection too, but is more factual and can 
be backed by corpus attestations in some cases. Joint use of introspection and 
corpus attestation is more and more recognized as a valid source of empirical 
data on acceptability (Johansson 1991: 313; Fillmore 1992: 58; 2001: 1;? McEnery 
& Wilson 1996: 16; Kepser & Reis 2005;?! Gries 2011: 872). Roll one's eyes, in 
its idiomatic meaning, denotes a feeling of surprise and rejection for something 
stupid or strange”, and often also an actual eye movement that expresses this 
feeling, but this physical element of meaning is perhaps not necessarily present 
in all occurrences of the idiom: 


"The corpus remains one of the linguist's tools, to be used together with introspection and elici- 


tation techniques. Wise linguists, like experienced craftsmen, sharpen their tools and recognize 
their appropriate uses? 

“One [cannot have] success in the language business without using both resources: any corpus 
offers riches that introspecting linguists will never come upon if left to their meditations; and 
at the same time, every native speaker has solid knowledge about facts of their language that 
no amount of corpus evidence, taken by itself, could support or contradict.” 

20“Why move from one extreme of only natural data to another of only artificial data? Both 
have known weaknesses. Why not use a combination of both, and rely on the strengths of 
each to the exclusion of their weaknesses? A corpus and an introspection-based approach to 
linguistics are not mutually exclusive. In a very real sense they can be gainfully viewed as 
being complementary? 

"Tt is one of the main aims of this volume to overcome the corpus data versus introspective 
data opposition and to argue for a view that values and employs different types of linguistic 
evidence each in their own right” 

“It is obvious that corpus linguists need to make subjective decisions all the time, and they 

need to document their subjective choices very clearly in their publications. However, in spite 

of these undoubtedly subjective decisions, many advantages over armchair linguistics remain: 
the data points that are coded are not made-up, their frequency distributions are based on 
natural data, and these data points force us to include inconvenient or highly unlikely examples 


that armchair linguists may 'overlook"" 
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(33) We've all rolled our eyes at a particularly catchy headline. 


Many idioms share this property (Burger 1998: 44). In their meaning, the feel- 
ing part is rather non-analysable, whereas the physical movement part is rather 
analysable. When compromising between these intuitions, not all speakers are 
likely to obtain the same result. It is not just that the semantic analysis of the feel- 
ing part is funny: more importantly, there is no reason why different observers 
would assign the idiom to the same class. 

The contrast between more or less reproducibly observable features is also 
observed in French, in Italian, and presumably in any language. Take this French 
idiom: 


(34) | se mettre le doigt dans l’ceil 
Lit. put one's finger in one's eye 


“have a mistaken understanding’ 


My own impressions in terms of analysability are precarious: does le doigt (lit. 
one's finger) really stand for an element of meaning like ‘understanding(x)’, met- 
tre (lit. put) for ‘choose’ and dans l'œil (lit. in one's eye) for ‘wrong’? 

Semantic analysability poses recurrent problems of reproducibility of observa- 
tion. This makes it a fuzzy feature. 


4.2 Related work 


Reproducibility of observation is not a new requirement. It is a central concern 
for American structuralists such as Bloomfield and Harris, who typically improve 
it by adjusting the definition of features under analysis, and in particular, by re- 
sorting to FORMAL OR SYNTACTIC CRITERIA, as in (16b) *He thought on Ann's feet, 
avoiding to rely directly on pure semantic intuition. This tradition focuses on 
selecting knowledge that can be reproducibly observed, as part of a quest for sci- 
entificity in linguistics. In the observation of semantic features, DIFFERENTIAL 
SEMANTIC EVALUATION is more reproducible than absolute semantic evaluation 
(Gross 1975: 391-392). For example, take the following French support-verb con- 
struction: 


“Pairs of sentences that are candidates for being related by a transformation are judged to be 


synonymous or not. Thus, meaning is only involved in comparisons, and differences in meaning 
are detected in this manner. In the physical sciences, it is well-known that absolute evaluations 
of a variable (e.g. temperature) lead always to rather crude results, when compared to differen- 
tial evaluations of the same variable. The situation appears to be the same in linguistics with 
respect to meaning. Attributing absolute terms to forms is quite problematic, and anyway, has 
proved to be rather unsuccessful, while comparing the meanings of similar forms may bring 
to light subtle differences that may be hard to detect directly? 


161 


Éric Laporte 


(35) Le mur a de la couleur. 
Lit. The wall has colour. 


"Ihe wall is colourful. 


Far from all interviewed speakers agree that (35) denotes intensiveness; in other 
words, this observation is little reproducible. Now take the following variant: 


(36) | Le mur a une couleur. 


"Ihe wall has a colour on it. 


When asked if (35) is more intensive than (36), much more speakers share this per- 
ception, agreeing that (36) is more neutral. The differential observation is more 
reproducible than the absolute one.”* Reproducibility decreases back if you com- 
pare phrases, for example de la couleur vs. une couleur, instead of complete sen- 
tences. The LG method is much about such practical techniques of elaborating 
the procedures of observation or the definition of features, in order to improve re- 
producibility. When you ask the right question, it is easier to agree on an answer. 
In practice, performers of LG work are trained to be systematically watchful of 
their own dubious or instable judgments, and to compare these judgments to 
those of their peers. This measurement of reproducibility is subjective, but peer 
controlled, in order that subjectivity does not affect the quality of the results. 
It is performed right from the beginning of the project, and separately for each 
feature, to detect which features raise reproducibility problems. Such detection 
leads to two types of decisions: 


(i) give up the study of "bad" features, that is those that cannot be observed 
with reasonable reproducibility 


(ii) look for “good” features for which methodological precautions ensure rea- 
sonable reproducibility 


A reproducibility issue often causes a shift from an intuition-defined feature to 
one or several new criterion-defined ones. The latter may be a little different, but 
they have an advantage: it is clearer what they are. Such decisions refine or shift 
the target of the description and, of course, eventually affect the classifications 
based on the features. 

There are few ongoing debates about such practices in linguistics, and even 
less in research on MWEs. Reproducibility is alien to Baldwin & Kim (2010)'s 


Here are two other examples of definite semantic differences: (35) denotes more of a favourable 
subjective judgment than (36), and (35) may evoke one or several colours, whereas (36) evokes 
one. 


162 


6 Choosing features for classifying multiword expressions 


concerns, except for an allusion in connection with interpretation of nominal 
compounds (Baldwin & Kim 2010: 275).”° In current practices beyond LG, mea- 
surement of reproducibility is more objective: it takes the form of inter-judge 
agreement statistics. But these statistics either focus on a small sample of features 
deemed representative, or handle features collectively, not individually (Palmer 
et al. 2005: 86 is an exception): in both cases, they don't help to tell the “good” 
features from the "bad". The inter-judge agreement approach tests how a team of 
descriptive linguists fare as regard reproducibility, it does not assess the poten- 
tial of each feature. It views reproducibility as a behavioral problem only, not as 
a syntactic or lexicological problem, and disregards the fact that the problem is 
different for each feature. 

In addition, current practices usually take into account only small samples 
of prototypical MWEs deemed representative." But many reproducibility issues 
stem from the diversity of lexical entries: their detection requires comprehensive 
scrutiny of the lexicon.?” 

Globally, with the shift from subjective to objective procedures, quality of mea- 
surement has deteriorated. It is worse than that: now, reproducibility assessment 
is rarely used for feedback on the aims of the description or on its practical pro- 
cedures. First, such feedback would require differential assessment on individual 
features. Second, inter-judge agreement is usually computed at the end of the 
descriptive phase (Meyers et al. (2004: 803) is an exception), when it is too late 
for feedback. Reproducibility assessment is only regarded as a quality indicator: 
researchers are content with measuring the symptoms and rarely attempt to cure 
them.?® 


? Anyway, in the case of analysability, no such improvements of the definition seem to be at 
hand. 

“Gibbs et al. (1989: 60) assess the consistency of undergraduates’ judgments of semantic decom- 
posability of a sample of 36 idioms. 

The complexity of the assessment of reproducibility has three dimensions: the number of fea- 
tures, the number of lexical entries and the number of judges. Informal LG practices deal with 
all three dimensions. But objective measurement of inter-judge agreement is costly, which 
leads to limiting its ambition in terms of two of the three dimensions: the number of features 
and of lexical entries. Thus, the operation loses its essential benefits. 

? Another way of improving the resources has emerged: automating error detection targeted at 
specific error types in resources. For example, Meyers et al. (2004) automatically check formal 
and heuristic properties of dictionary entries; Cohen et al. (2011) check a constraint that the 
occurrences of dictionary entries in a corpus are supposed to fulfill. These a posteriori checks 
are celebrated as contributing to "quality assurance" and to "the development of a true science 
of annotation" (Cohen et al. 2011: 82), but they are hardly relevant to the present discussion, 
since they do not target features with reproducibility issues. In addition, they do not contribute 
to refining the target of description, as a priori vigilance about reproducibility does. 
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LG also contributes to reproducibility indirectly, by supporting the publica- 
tion of results in readable formats. LG tables display readably which entries 
have which features. Their well-known tabular format is theory-, framework-, 
formalism- and implementation-independent and allows for explicit negative in- 
formation, e.g. the fact that bear comparison to has no passive. Publishing LG 
tables in scientific publications and web sites indirectly tends to increase repro- 
ducibility, since peers can easily check if they agree with the recorded informa- 
tion. Kaalep & Muischnek (2008) adopt a readable tabular format too, but do 
not individuate columns for individual features. LG tables are used as source 
code, i.e. for manual edition; they usually need to be automatically translated into 
application-dependent formats (Tolone & Sagot 2011; Constant & Tolone 2010), 
and this is their main flaw for computational linguists (Hathout & Namer 1998; 
Gardent et al. 2005).?? However, other formats are less readable: the DuELME 
Grégoire (2010: 34-36) and Lefff (Tolone & Sagot 2011) formats contain lists of 
features without explicit negative information: to check that an entry does not 
have a given feature f, you have to verify that none of the features it has is f 


4.3 Discussion 


Reproducibility is an epistemological requirement for scientificity. Low repro- 
ducibility casts a doubt on what exactly a feature is, since different observers 
perceive the feature differently. Gross (1981: 14) even says about traditional se- 
mantic classification of prepositional adjuncts: "Perception of such distinctions 
depends a lot on individuals; therefore, they might be of no interest" (my transla- 
tion). Features with high reproducibility of observation are essential when lexical 
entries are described manually, and provide a good basis for a classification with 
an ambition of stability and scientificity. In addition, many of these features are 
factual, and therefore informative for further research, no matter the linguistic 
theory adopted. Factual features are available for language processing, especially 
when they determine the possibility of occurrence of actual forms, as in (16), (17) 
and (23)-(31): this is essential to automatically recognizing such MWEs. 

LG authors' experience on a number of languages proves that the requirement 
of reproducibility does not drastically limit the diversity of features to be studied. 
Their tables of MWEs contain a large collection of useful features. For instance, 
the study of prepositional-phrase idioms compatible with be in English by Ma- 
chonis (1987) showed that many of them admit a syntactic operation that inserts 


?'Tolone (2011) improved part of the LG in that regard: she homogenized the mnemonic identi- 
fiers of properties, encoded properties common to whole classes and created a user documen- 
tation in English. 
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verbs such as get or throw and a causative or agentive subject, as be in a jam ‘be 
in trouble”: 


(37 a. Kathy was in a jam. 
b. An unfortunate situation had (got + thrown) Kathy into a jam.?? 


But some idioms don't admit this operation with the same verbs, as be in the 
wrong “be morally or legally wrong: 


(38) a. The cyclist is in the wrong. 
b. Slapping the pedestrian (got + *threw) the cyclist in the wrong. 


Features related to this little-known causative construction, usually classified in 
recent literature under the large category of “lexical variation”, are decisive for 
the automatic parsing of sentences such as (37b). 


4.4 A more detailed example 


Semantic features such as analysability are rare in LG descriptions: they are dif- 
ficult to define with sufficient rigour. 1 will exemplify this difficulty with a new 
semantic feature which is interesting for NLP, but requires a precise definition 
before it can be encoded in LG tables. 

The French law term citer un témoin (lit. “quote a witness’) ‘call somebody as a 
witness' is an MWE because the verb citer has this meaning only with the noun 
témoin. Still, the meaning of this noun in the idiom is the same as a meaning this 
noun can also have, as a (lexicalized) law term, when citer is not present at all in 
the context. Even in the idiom, it usually refers to a specific person. It can belong 
to a chain of coreferring expressions, no matter whether it is the first element of 
the chain, as in (39), or not, as in (40). 


(39) La défense a cité un témoin. Il vient de s'exprimer. 
Lit. The defence quoted a witness. He has just expressed himself. 


"Ihe defence called a witness. He has just spoken’ 


In (39), un témoin ‘a witness’ and il ‘he’ refer to the same person. 


The notation (got + thrown) serves to refer to several variants, here both had got Kathy into a 
jam and had thrown Kathy into a jam. This notation inspired from algebra and commonly used 
in LG is more informative than the notation got/thrown, since the parentheses delimit precisely 
where each variant begins and ends. 
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(40) Ils avaient un autre témoin, mais finalement ils ne l'ont pas cité. 
Lit. They had another witness, but finally they did not quote him. 


“They had another witness, but they ended up not calling him! 


In (40), un témoin in the idiom is replaced by the pronoun I’ “him”, which refers 
to the same person as un autre témoin 'another witness’. In a chain of coreferring 
expressions like those of (39) and (40), the syntactic markers of the coreference 
such as determiners, pronouns, etc., follow the same rules as when the noun is 
not part of an idiom. For example, in (39), il ‘he’ has the same form as when un 
témoin ‘a witness’, but not the rest of the idiom, is present in the context: 


(41 La défense a un témoin. Il vient de s'exprimer. 
"Ihe defence has a witness. He has just spoken. 


The feature that I wish to single out, and which témoin ‘witness’ in (39) shares 
with many other idiom components, is a combination of three properties: 


(i) The component, when used in the idiom, has mandatorily a meaning that 
it can also have (as a lexicalised meaning) when the rest of the idiom is not 
present at all, not even in the context, as opposed to feet in think on one's 
feet ‘improvise a reaction quickly”, or to eyes in roll one’s eyes, where eyes 
doesn't always refer to eyes. ?! 

(ii) The component can be the first in a chain of coreferring expressions, and 
then the syntactic markers of the coreference: determiners, pronouns, etc., 
follow the same rules as when the noun is not part of the idiom. This does 
not happen, for instance, with posture in the French idiom of (42): 


(42) Kathy était en mauvaise posture. 
Lit. Kathy was in bad posture. 


'Kathy was in trouble: 


To refer to the trouble after they have used this idiom, speakers use another 
noun: 


“Property (i) matches what Burger (2007: 96) calls partly idiomatic expressions. It is more re- 
strictive than analysability/decomposability: for instance, in pull strings ‘covertly use one's 
influence on personal connections’, the noun string does not keep any of the lexicalized mean- 
ings it has when pull is not present at all. As a consequence, the feature that I am defining is 
different from analysability too. 
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(43) Kathy était en mauvaise posture. Ces difficultés auraient pu étre évitées. 
Lit. Kathy was in bad posture. This trouble could have been avoided. 


“Kathy was in trouble. This trouble could have been avoided: 
Without this idiom, they can use the same noun: 


(44) Kathy avait une posture fiere. Cette posture a été commentée. 


“Kathy had a proud posture. This posture has been commented: 


But if the first expression referring to the trouble is part of the idiom of 
(42), speakers do not use the same noun for other coreferring expressions: 


(45) * Kathy était en mauvaise posture. Cette posture aurait pu étre évitée. 
Lit. Kathy was in bad posture. This posture could have been avoided. 


‘Kathy was in trouble. This trouble could have been avoided.”? 


The component can occur in a chain of coreferring expressions without 
being the first, and then the syntactic markers of the coreference such as 
determiners, pronouns, etc., follow the same rules as when the noun is 
not part of the idiom. This does not happen, for example, with strings in 
pull strings ‘covertly use one's influence on personal connections’. When 
speakers refer to the connections before using this idiom, the coreference 
between the first mention of the connections and the idiom component is 
not explicitly marked: 


(46) I needed connections to make myself known, and John could pull 
strings for me. 


In this form, strings has the same form as if there were no mention of it 
before. Without the idiom, we observe a syntactic marker of coreference: 


(47) | Ineeded connections to make myself known, and John provided them 
to me. 


"Example (45) is not entirely parallel to (39): (39) involves pronouns and (45) involves determin- 
ers and nouns. Studying feature (ii) requires taking into account diverse syntactic markers of 


coreference. This feature is connected with pronominalizability, but not only. 
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Speakers do not use this marker if the second mention of the connections 
is part of the idiom: 


(48) * Ineeded strings to make myself known, and John could pull them for 
me. 


But why get interested in the combination of features (i)-(iii),? which has 
never been studied or named? Because it is shared by many other terminological 
idioms, for example the French term of geometry abaisser une perpendiculaire 
à ‘drop a perpendicular on to’ (lit. move down a perpendicular to).* The idiom 
component that has the feature, like témoin ‘witness’, is often a technical term too, 
and is able to denote a referent in a clear and specific way. In such case, these 
idioms are meaningful elements of technical texts, a realistic target for future 
improvements to the automated understanding of natural language texts. 

My definition is based on properties (i)-(iii), which are relatively formal" 
Even so, this feature is probably not ready for encoding, that is for production 
of a satisfactory list of the idioms with this feature: only large-coverage encod- 
ing experiments would tell if this definition ensures sufficient reproducibility of 
observation. 

In this section, all the examples above focus on nouns that are parts of verbal 
idioms. How does the feature extend to other PoS? Here are two noteworthy 
features closely related to this one. 

In a large proportion of multiword nouns, the head noun keeps all the gram- 
matical and semantic behaviour it has as an independently existing lexical entry. 


It is not an "aggregate" of features (i)-(iii) as in $3 above: it is specifically a conjunction of these 
independent features, in the sense that an idiom has it if and only if it has simultaneously (i), 
(ii) and (iii). Alternatively, features (i)-(iii) might be studied separately, but they are likely to 
be less useful than their conjunction. 

% Abaisser has this meaning only with perpendiculaire and parallele ‘parallel’, which keep their 
autonomous terminological meaning and ability to be referred to anaphorically: 


(i) On abaisse une perpendiculaire de A à BC. Cette droite est parallele à CD. 
“A perpendicular is dropped from A on to BC. This line is parallel to CD? 


?My definition does not include another striking property of témoin ‘witness’ in (39): it can 
refer to a specific entity, as opposed to fire in (29a) John fights fire with fire, which alludes to 
ways of fighting in general. The semantic distinction between specific and generic reference is 
a matter of pure intuition. As such, its reproducibility of observation can be low, for example 
in the case of bread in They will take the bread from our mouths "Ihey will divert money from 
us’: in this sentence, does the bread refer to material goods in general, or to a specific instance 
of income? 
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This is the case of red wine: it is a (terminological) MWE because red has this 
meaning only with wine, but wine can be equated with the independent lexical 
entry wine, with the same properties (i)-(iii) as above. With multiword nouns, 
this is related to another test: red can usually be inserted in sentences with wine 
or removed from them, without unexpected changes in acceptability or meaning: 


(49) a. They have an interest in wine. 


They have an interest in red wine. 


(50) a. Is red wine healthy and worth the calories? 


Is wine healthy and worth the calories? 


This test uses differential semantic assessment. Smooth operator “persuasive per- 
son; manipulative person' doesn't share this feature, as the meaning changes in 
(51)-(52) show: 


(51) a.  Askthe operator to dial. 
“Ask the switchboard operator to dial? 
b. Ask the smooth operator to dial. 
“Ask the persuasive person to dial”; ‘Ask the manipulative person to 
dial? 
(52) a. Any lady I’ve dated will tell you I'm no smooth operator. 
‘Any lady I've dated will tell you I'm not manipulative’ 
b. Any lady I’ve dated will tell you I’m no operator. 
‘Any lady I’ve dated will tell you I’m not a switchboard operator’ 


Some adverbs are specific to one or a few verbs, which nevertheless keep all 
their behaviour. Here are two examples in French from the tables of multiword 
adverbs by Gross (1986): 


(53) a. (chanter + crier + rire) à gorge déployée 
Lit. (sing 4- shout 4- laugh) at opened-out throat 
“(sing + shout + laugh) out loud” 


b. No aller à N], comme un tablier à une vache 
Lit. No fit N, as an apron does a cow 


“No fit N; supremely badly’ 


As opposed to the preceding examples, these have no terminological value. 
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Care for reproducibility in observation of linguistic facts characterizes a con- 
ception of humanities in which scholars not only share insights and deepen their 
intuition, but also gather reliable factual knowledge, paying attention to practi- 
cal techniques that improve the quality of their description. In this conception, 
descriptive work, and in particular lexical description, is fundamental. For in- 
stance, assessing reproducibility of observation is a practical matter: it involves 
scanning through the lexicon while trying to describe which entries have a fea- 
ture and which don't. Thus, preferring features that can be observed by humans 
in a reproducible way is good practice. 


5 Checking information against the lexicon 


5.1 Discussion 


Checking information against the lexicon is still alien to a large part of MWE 
research. Baldwin & Kim (2010) do not cite studies using intensive lexical de- 
scription of MWEs, except for Estonian MWEs. The companion website to their 
paper cites corpora and tools, but no NLP dictionaries.” Among the tools, it even 
omits those based on dictionaries. Current descriptive research is not eager to 
achieve large lexical coverages. FrameNet has a small coverage of MWEs (Hart- 
mann & Gurevych 2013), and so do, among NLP dictionaries, VerbNet, WordNet 
and Meaning-Text. Bond et al. (2015) encode for HPSG a sample of English idioms 
with a possessive coreferent with the subject, like roll one's eyes, but here is how 
the size of the result compares with previous efforts: Freckleton (1985)'s classes 
C1A and C11 contain 538 verbal idioms with such a possessive, while Bond et 
al. (2015: 64)'s four classes that correspond to C1A and C11 total 168 ones. Aside 
from LGs, sizable lexical databases of MWEs are few. The NomLex-Plus and Nom- 
Bank dictionaries of English nouns with predicate-argument structure list 8000 
entries (Meyers 2007). Kaalep & Muischnek (2008)'s database lists 13000 Esto- 
nian MWEs. The DuELME dictionary of Dutch MWEs totals 5000 expressions 
(Grégoire 2010).?7 

My point in this part is that information on MWES is worth checking against 
the lexicon. Reluctance against lexical description is rarely explicit, and when it 


is, it is not motivated by sound reasons.?? 


**http://handbookofnlp.cse.unsw.edu.au/?n=Chapter12 was looked up in August 2016. 

The SemLex Dictionary of Czech MWEs is still little documented in publications (Bejček & 
Stranäk 2010). 

55^ Again, in itself this type of approach [interviews, surveys, statistics] is neither good nor bad. 
The question is whether it leads to the discovery of principles that are significant. We are back 
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Sure, intensive investigation into the lexicon is costly. For example, the con- 
struction of LG tables of MWEs, which are comprehensive repositories with rep- 
resentation of individual features (cf. 82.2.1), has always involved considerable 
work. But the objective of a satisfactory processing of MWEs is worth cost and 
effort. (The reason one enjoys Dostoyevsky is not because he is easy to read.) 
And the tables are available for several languages, which shows that this work is 
realistic. 

The reluctance towards intensive lexical description might come from a feel- 
ing that it is deemed an unskilled, low-grade occupation. But such a feeling is 
unfounded: in projects of construction of large lexical databases of MWEs, lin- 
guists are obviously engaged in highly skilled labour. 

The reluctance may be directed towards manual work. As computer science 
is about automating information processing, many computational linguists may 
understandably feel excited about devising "knowledge-free" solutions that avoid 
the need of any labour-intensive activity, be it in preliminary operations. But, in 
the case of MWE-related NLP, relying on this only hope is adventurous: the goal 
of fully automating acquisition of knowledge about all MWEs has been giving 
hard times to the community for more than 15 years. 

No dictionary is 100% complete or 100% error-free, but this does not make them 
useless. 

And manual lexical description has several advantages. The resulting data al- 
lows for more well-documented studies and is likely to be useful for making 
successful rules or devising successful machine learning experiments. When lin- 
guists scrutinize 10 features on a comprehensive part of a 1000-item class, what 
they find out is worth taking a look. It provides examples and counter-examples 
which are useful to test predictions, proposals and hypothetical rules or gener- 
alities. LG tables, as large repositories of factual features, are a source of exam- 
ples for further research, no matter the theory, framework or implementation 
to be used. Creativity of language is a major obstacle to its scientific study, and 
it lies, among other things, in the combinatorics of lexical items and grammat- 


to the difference between natural history and natural science. In natural history, whatever you 
do is fine. If you like to collect stones, you can classify them according to their color, their shape, 
and so forth. Everything is of equal value, because you are not looking for principles. You are 
amusing yourself, and nobody can object to that. But in the natural sciences, it is altogether 
different. There the search is for the discovery of intelligible structure and for explanatory 
principles. In the natural sciences, the facts have no interest in themselves, but only to the 
degree to which they have bearing on explanatory principles or on hidden structures that have 
some intellectual interest" (Chomsky 1979: 58-59) Beyond the depreciative rhetoric of this 
passage, Chomsky actually suggests skipping factual observation when it involves extensive 
description. 
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ical constructions: systematic investigation in the lexicon is therefore a way of 
addressing this problem. 

Intensive lexical description is crucial to selecting features for classification, 
and therefore to the quality of classification. The construction of the NomLex- 
Plus and NomBank dictionaries of English nouns with predicate-argument struc- 
ture involved an unprecedented investigation into support-verb constructions in 
English and into features to recognize and classify them (Meyers 2007). The study 
of idioms that take the form of a prepositional phrase in Romance languages 
(Danlos 1980; Ranchhod 1990; Gross 1996; Vietri 1996), English (Machonis 1988) 
and Greek (Moustaki 1995) singled out a particularly useful feature for the top of 
MWE classifications. Some of these idioms are compatible with be or the equiv- 
alent copula in other languages and may appear in predicative position, as in 


(54b): 


(54) a. John will reach the end on time. 
b. John will be on time. 


Others may not: 


(55 a. The crisis has a demographic cause in the final analysis. 


"Ihe crisis has a demographic cause, when everything has been con- 
sidered. 


b. * The cause is in the final analysis. 


Those compatible with be, like on time 'punctually; punctual' in (54), on vacation 
and on the spot “immediately; in the same place; in trouble”, usually pose a prob- 
lem of PoS: are they closer to adjectives or to adverbs??? In contrast, those like in 
the final analysis in (55a) and for instance are clearly adverbial expressions. Com- 
patibility with be, that is the contrast between (54) and (55), provides a relatively 
sharp division in a large number of cases where PoS distinction would otherwise 
be particularly uncertain. Applying this criterion requires investigating into the 
syntactic contexts of idioms in sentences, but this is the usual price to be paid 
to resolve PoS issues,“ and Dog are key to a general classification of MWEs. So, 
this criterion is more relevant than the presence vs. absence of a determiner, re- 
tained by Baldwin & Kim (2010: 278) at the top of their classification, a criterion 
that only uses the internal structure of idioms. 


®Lexicalized MWEs are lexical items, so they may have a Pos like single-word lexical items do. 

The most appropriate definition of each PoS is based on its possible syntactic contexts in sen- 
tences. For example, in English, a noun is to be recognized by its ability to be preceded by 
determiners and adjuncts, followed by adjuncts, etc. 
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Lexical data deepens knowledge of how correlated two features are. It does 
so by providing reliable statistics on lexical entries: how many entries with fea- 
ture f also have feature g? For example, the causative construction of (37b), with 
a prepositional-phrase idiom and get, throw or other verbs like keep, is observed 
only when the idiom is compatible with be: 


(56) a. John will be on time. 
b. This gift will keep John on time. 


(57) a. * The cause is in the final analysis. 


>” 


* This point keeps the cause in the final analysis. 


Such specific grammatical information allows for measuring correlations accu- 
rately. 

Intensive lexical description tends to make researchers more cognizant of vari- 
ation, including less frequent variations and variations of less frequent items. As 
such, it is complementary to corpus annotation, which rather makes them aware 
of context-related issues. 

Lexical description also provides means of separating homonymous entries, 
for example the various interpretations of on the spot: ‘immediately’; ‘in the same 
place’; ‘in trouble’. Such separation, in turn, is essential to construct cross-lingual 
tables (Ranchhod & De Gioia 1996). 

All these benefits of lexical description make it a priori useful for applications. 
There is still little significative feedback from the NLP use of any comprehensive 
dictionary of MWEs, but this may come from the complexity of the problem and 
the interdependence of all subproblems of symbolic syntactic parsing. 


5.2 Predicted vs. checked features 


Gibbs & Nayak (1989: 104) hypothesize that semantic analysability/decomposabil- 
ity "determines the syntactic behavior of idioms". In this section, I examine the 
present and potential consequences of this conjecture. 

With Nunberg et al. (1994), the hypothesis becomes two claims. First, the ana- 
lysability of an expression predicts syntactic operations are applicable to it: "the 
syntactic properties of idioms [that is the applicability of syntactic operations] 
are largely predictable from the semantically based analysis of idioms we are 
proposing [i.e. their analysability]” (Nunberg et al. 1994: 507). In parallel, the 
unanalysability of an expression predicts syntactic operations are not applicable 
to it: “we (...) explain a variety of ‘transformational deficiencies’ of idioms by 
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positing a bifurcation between [unanalysable] and [analysable] expressions, with 
only the latter type permitting those processes" (Nunberg et al. 1994: 508). 

After these claims, analysability became popular in the community and was 
used to define some of the major classes of MWEs. The two predictions give a 
sense that analysability is an underlying, fundamental property, and that its use 
in classification implements a strategy of parsimony, since assigning an entry to 
a class automatically specifies all the predicted features. Sag et al. (2002: 4) retain 
only the second prediction: “due to their opaque semantics, non-decomposable 
idioms are not subject to syntactic variability, e.g. in the form of internal modifi- 
cation (#kick the great bucket in the sky) or passivization ("the breeze was shot): ^! 

However, Nunberg et al. (1994) do not check the claims, either on available data 
or on original data. A general claim requires systematic verifications, which they 
mention as a perspective for future work: "testing this prediction systematically 
is a nontrivial project" (Nunberg et al. 1994: 531). Therefore, both claims remain 
hypotheses. 

When authors check the predicted syntactic features, they readily find out 
counter-examples to both predictions (Abeillé 1995; Stathi 2007). Here are three 
more that I picked from the lists of French verbal idioms by Gross (1982): 


(58) rater un éléphant dans un couloir 
Lit. miss an elephant in a corridor 


“be unable to hit the broad side of a barn; have poor aim; be unable to 
reach targets' 


Example (58) seems analysable as miss(x, easy-target), but does not admit syntac- 
tic variations, not even omission of the prepositional complement. 


(59) trouver chaussure à son pied 
Lit. find shoe to one's foot 
“find the perfect match for oneself? 


Example (59) seems analysable as something like find(x, partner), but does not ad- 
mit syntactic variations either. Conversely, (60) is hardly semantically analysable: 


(60) mettre toutes les chances de son cóté 
Lit. put all the chances on one's side 


“not take any chances” 


“Shoot the breeze means ‘talk casually’. 
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But it admits the passive form: 


(61) Toutes les chances sont mises de votre côté. 
Lit. all the chances are put on your side 


"You are not taking any chances: 


Nunberg et al. (1994: 512) extend their claims in the case when an idiom is 
analysable: "the syntactic versatility of an idiom is a function of how the mean- 
ings of its parts are related to one another and to their literal meanings". In other 
words, details of the semantic structure of analysable idioms would predict which 
syntactic operations are applicable. 

It is particularly difficult to give credit to this hypothesis. Its authors do not 
check it any more than the previous one; the alleged rules of prediction are un- 
known. Formalizing them would be a challenge that no one has taken up since. 
Instead, Riehemann (2001) finds that which types of syntactic variation a given 
idiom can undergo is highly unpredictable. 

Baldwin & Kim (2010: 280) adopt Nunberg, Sag & Wasow's 1994 hypothesis as 
their own: "the exact form of syntactic variation [of verbal idioms] is predicted 
by the nature of their semantic decomposability”. But they do not provide any 
evidence to support it. Even worse, their formulation suggests that, instead of 
describing the syntactic variation of verbal idioms, one might infer it automati- 
cally from a description of their analysability. But recall that syntactic variation 
is more reproducibly observable than analysability: thus, the suggested proposal 
comes down to inferring several factual features from a property that poses prob- 
lems of definition and observation (cf. 83.2). Such a process would hardly be ef- 
fectual. 

Predicting features might seem a clever move. But it necessarily begins as a 
hypothesis, which needs to be checked to get any scientific value. So, predicting 
features does not allow for bypassing the verification step. 


6 An adapted classification 


I propose in Figure 1 a decision tree adapted from Baldwin & Kim (2010: 279), 
but which avoids the flaws discussed above, and in particular features that are 
too fuzzy or difficult to observe. It uses all the MWE-related LG work, including 
the studies on English, Romance languages, Greek, Korean and other languages, 
cited in 82.2.1, 82.2.3, 83.2 and 85.1. Much of this work was conducted in parallel 
and cross-linguistic comparisons showed that, even though the details of formal 
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criteria depend on the languages ($2.3), the notions they define are similar. For 
example, the typology of French support verbs by Gross (1998) is transferred to 
the English FrameNet by Ruppenhofer et al. (2006: 37-38) without any modifica- 
tion. The classification in Figure 1 is in terms of notions defined by criteria men- 
tioned in the text of this chapter, but the criteria are not repeated in the figure. 
Thus, it is formulated for English and easily adaptable to many other languages: 
to adapt it to French, substitute étre for be. 


MWE 
lexicalized non-lexicalized 
salt and pepper 
MWE without construction 
support verb with support 


verb and noun 
have an aim 


verbal idiom  multiword multiword prepositional phrase multiword 
take stock noun adverb compatible with be adjective 
traffic lights for instance on time black and white 


Figure 1: Classification of MWEs. 


The top distinction is between lexicalized and non-lexicalized expressions. By 
non-lexicalized expressions, I mean those that are fully compositional but in 
which a statistical preference for an element is not explained by extra-linguistic 
facts. For example, the preference for sell a house over sell a wall is explained by 
cultural habits, so we don't need to describe it as a linguistic property; therefore, 
sell a house is not an MWE. In contrast, the preference for the French phrases 
tondre la pelouse ‘mow the lawn’ and couper l'herbe ‘cut the grass’ over tondre 
l'herbe ‘mow the grass’ and couper la pelouse ‘cut the lawn’ is a purely linguistic 
fact. This suggests tondre la pelouse and couper l'herbe are MWEs, but they are not 
lexicalized, since the other two are in use. The term black and white 'composed of 
shades of black or of a single colour' is lexicalized. If it were fully compositional, 
speakers would be able to interpret white and black the same way as black and 
white, which they aren't. The same holds for traffic lights: if it were fully compo- 
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sitional, speakers would be able to interpret it as another type of light connected 
with traffic. 

The second distinction in Figure 1 relies on the notion of support-verb con- 
struction. This is not an easy distinction, but the literature shows that trained 
linguists are able to make it on the basis of formal criteria that ensure sufficient 
reproducibility of observation: these criteria are outlined in 82.2.3. Support-verb 
constructions are a significant class because they are numerous both in texts and 
in a dictionary: out of the 62,100 MWE entries of the French LG, 12,700 (20%) 
are support-verb constructions (Tolone 2011). Support-verb constructions have 
in common a crucial property which is a good reason to place them so close to 
the top of the classification: the construction with the verb, for example have a 
passion, and the construction without the verb, which is usually a predicational 
noun, here passion, are not adequately described by two distinct lexical entries. 
For example, the arguments of have a passion are exactly the same as those of 
passion, including the preposition of the complement (for) and the restrictions 
on what may fill both slots (the subject contains a human noun; the complement 
may contain a human, concrete or abstract noun or an infinitival clause). More- 
over, occurrences without the support verb are usually more frequent in texts 
than occurrences with it (Laporte et al. 2008). 

The third distinction in Figure 1 is based on PoS.* It conflates adverbs with 
prepositions and conjunctions, with the view that a multiword preposition like 
in spite of or a multiword conjunction like in case that may also be analysed as 
a multiword adverb with a free prepositional or clausal slot. This lowest level 
of the tree includes an additional “PoS”, namely “prepositional phrase compati- 
ble with be", such as on time and on vacation, in order to sort out partially the 
problem of assigning a PoS to idioms taking the form of prepositional phrases: 
are they adverbs or adjectives? The compatibility with be provides a relatively 
sharp division in a large number of cases where PoS distinction is particularly 
uncertain (cf. (54)-(55), 85.1). The multiword adjective category is meant for ex- 
pressions that do not take the form of prepositional phrases, for example black 
and white ‘composed of shades of black or of a single colour’ or safe and sound 
‘unharmed’. In Figure 1, the distinction between support-verb constructions or 
not is just above the decision about PoS. It could also be the other way round, 
which would make the support-verb-construction class a brother of the verbal- 


“The appropriate definition of each PoS in this context is based on its possible syntactic contexts 
in sentences (cf. 85.1, footnote 40). 

Here, free means that the content of the slot, that is the noun phrase or the embedded clause, 
is variable. 
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idiom class. This variant would give prominence to PoS, which are always key 
information, well-known classes and often clear-cut features. 

In Figure 2, I propose an alternative classification that may bewilder many re- 
searchers. But those that take seriously the notion of support-verb construction 
will probably find it more consistent than Figure 1. From Gross (1981: 34), Ranch- 
hod (1983) and Cattell (1984), the notion of support-verb construction includes 
constructions like be angry or get loose, where the support verb is be or one of 
its variants (Meyers 2007: 123). With this view, phrases like be angry or be a 
genius become support-verb constructions and therefore MWEs. Another conse- 
quence is that, in a support-verb construction, the core of the predicate may be 
an adjective or even a prepositional phrase (e.g. be on time) instead of a noun. Few 
computational linguists are familiar with these two ideas. But analysing all these 
expressions as support-verb constructions is consistent. They undergo semantic 
and syntactic phenomena observed with other support verb constructions: 


(i) Syntactic operations produce constructs where the core of the predicate 
occurs without the verb, with the same meaning. For example, in the same 
way as have disappears in the alternance between the habit the customer 
had and the customer's habit, the verb be also disappears between a cus- 
tomer who was angry and an angry customer. 


(ii) Other verbs can replace be, causing an aspectual or stylistic effect: compare 
The customer was angry with The customer got angry. This pair is parallel 
to The customer had a habit / The customer gained a habit. 


(ii) There exist constructs with an additional causative or agentive subject and 
another verb, as in (37) or in The team was confident / Football made the team 
confident. Such pairs are parallel to The team had a goal / Football gave the 
team a goal. 


Figure 2 adopts this view and considers the copula (a linguistic term for be or 
its equivalent introducing a predicate) as a part of a support-verb construction. 
Prepositional phrases compatible with be shift to support-verb constructions. 
Since more expressions are considered MWEs than in Figure 1 and support-verb 
constructions become more diverse, they are divided in subclasses too, taking 
into account the PoS of the core of the predicate.** The core of the predicate may 


“The exact list of PoS under support-verb constructions and non-support-verb constructions 
depends on languages: in Arabic, Chinese or Korean, among others, predicational adjectives are 
used without a copula, and the class of copulative constructions with a predicational adjective 
is irrelevant for them. 
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MWE 


lexicalized non-lexicalized 
Pi“ sale and pepper 
MWE without support-verb 
support verb construction 
verbal multiword multiword Vsup Vsup 
idiom noun adverb is not is 
take stock traffic lights for instance copula copula 
with with with with 
predicational predicational predicational prepositional 
noun noun adjective phrase 
have an aim be a genious be angry be on time 


Figure 2: Classification of MWEs where copula is considered a support 
verb. 


be either a word (have an aim, be a genius, be angry) or multiword (have a point 
of view, be a smooth operator, be safe and sound, be on time). There are two new 
categories: copulative constructions with a predicational adjective, for example 
be angry, be safe and sound, ? and with a predicational noun, for example be a 
genius, be a smooth operator. 

In Figure 2, the distinction between support-verb constructions or not is just 
above the decision about PoS. In addition, the PoS-based classification of support- 
verb constructions takes into account the PoS of the core of the predicate: noun 
(aim, point of view, genius, smooth operator), adjective (angry, safe and sound) or 
prepositional phrase (on time). This is an essential element of the diversity of 
these constructions. We can shift the PoS level of the decision tree above the 


®Non-predicational adjectives are those compulsorily attributive, for example prime in This is 
John's prime role: “This is John's role that is prime. Another sense of prime corresponds to a 
predicational adjective: 21 is not prime. 
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support-verb-construction level, but then the tree will classify all support-verb 
constructions as verbal MWEs, and it will be desirable to add a second PoS level 
below, to take into account the PoS ofthe core of the predicate. Thus, the decision 
tree will have one more level than Figure 2 to define the same classes. 


7 Conclusion 


Something is to be learned from the experience of the last 20 years in respect of 
choosing features for classifying MWEs. Current practice routinely uses fuzzy 
features, or features defined in an imprecise way. On many occasions, a cluster of 
loosely correlated features is considered as a single feature. The choice of features 
with such flaws is likely to lead to classifications less fruitful for computational 
use. For example, describing the analysability/decomposability of verbal idioms 
is much less feasible and useful than describing their syntactic variation. 

Selecting more appropriate features is not an easy task. It requires prioritizing 
good practices when studying MWEs. One of them consists in systematically 
assessing the reproducibility of observation of each feature, in order to obtain 
reliable repositories of lexical data. Another good practice is to check facts and 
predictions against the lexicon. It is understandable that some researchers try to 
avoid the patient examination of thousands of lexical entries for dozens of indi- 
vidual features, in the hope to reach the same results through other means. But it 
turns out that the laborious descriptive work they wish to elude is required not 
only to check hypotheses, but also to come across valid hypotheses: researchers 
that ignored large-coverage data constructed unverifiable hypotheses that won 
the attention of the community and resulted in loss of time. 

TheLG approach implements these good practices in descriptive and analytical 
work. On the basis of the results of such work carried out on several languages 
in parallel, I outlined an enhanced classification of MWEs. 


Acknowledgements 


Thanks to the anonymous reviewers, to the scientific editors and to Alexis Neme, 
for their comments, suggestions and questions that considerably enriched this 
paper. Only I am responsible for its content. 


180 


6 Choosing features for classifying multiword expressions 


Abbreviations 
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Free subject verb multiword expressions (MWEs) of Modern Greek and English pro- 
vide data that challenge the theoretical status of the syntactic notion OBJECT. We 
compare the syntactic reflexes of three types of verbal complement: objects of typi- 
cal monotransitive verbs, indirect objects of ditransitive verbs and fixed accusative 
noun phrases (NPs) that occur as direct complements of verbs in MWEs. Passivi- 
sation, clitic replacement, object optionality and distribution present themselves 
as syntactic reflexes that draw relatively clear cut lines across these three classes 
of verbal complements and suggest that the Grammatical Functions OBJ(ect) and 
OBJ(ect)g of LFG should not be assigned to the fixed accusative NPs that occur in 
verb MWEs; rather a new Grammatical Function should be defined for this pur- 
pose. 


1 OBJ and OBJ, 


11 OBJ and OBJ, in Modern Greek and English 


It is widely claimed that the grammatical behavior of MWEs can be captured with 
the same machinery that is used for compositional structures (Gross 1998a,b; Kay 
& Sag 2014) and Bargmann & Sailer 2018 [this volume]. We will present evidence 
from Modern Greek and English that possibly challenges this claim at the level 
of Grammatical Functions (GFs), more particularly the notion of syntactic ob- 
ject. GFs are primitive concepts for Lexical Functional Grammar (LFG) that is 


Stella Markantonatou & Niki Samaridi. 2018. Revisiting the grammatical function “ob- 
ject” (OBJ and OBJo). In Manfred Sailer & Stella Markantonatou (eds.), Multiword ex- 

| pressions: Insights from a multi-lingual perspective, 187-213. Berlin: Language Science 
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the theoretical framework of our discussion. Other linguistic theories, such as 
transformational grammar (Baker 2001) and HPSG (Pollard & Sag 1994) use GFs 
implicitly through appropriate structural interpretations. 

LFG distinguishes between two objects, the OBJ and the OBJ (Bresnan & 
Moshi 1990; Dalrymple 2001). OBJ combines with prototypically transitive verbs. 
According to existing wisdom on syntax and semantics, the NP rov kwöıka rcov 
Nadi (ton kodika ton Nazi) ‘the Nazi code’ (1) is the object of the transitive verb: 
it is marked with the accusative case while the semantics of the eventuality of 
code breaking assigns it the Proto-Patient role (Dowty 1990). 


(1) Tioúpiyx: o kpuzToypáqog zov éozooe rov kóóika twv Nai. 
Turing: o kriptoyrafos pu  espase ton kodika ton Nazi. 
Turing: the cryptographer who broke the code.Acc the Nazi 


"Turing: the cryptographer who broke the Nazi code: 


OBJo (Bresnan & Moshi 1990) always co-occurs with an OBJ in the environ- 
ment of an active predicate. Its distribution is restricted to the so-called ditransi- 
tive verbs. In (2) the NP a book instantiates the OBJg GF and the NP Sue instan- 
tiates the OBJ. The NP Sue becomes the subject of the passivised verb in (3). 


(2 Helen gave Sue a book. 


(3) Sue was given a book. 


Modern Greek has a relatively small number of ditransitive verbs, such as the 
verb dıödorw (didasko) ‘teach’ (4)-(7), that subcategorise for OBJo (Kordoni 2004). 
Examples (5)-(7) show that Modern Greek passive ditransitive verbs pattern with 
standard English passive verbs (3): the NP icropía (istoria) ‘history’ that instan- 
tiates the OBJe does become the subject of the passive form of the verb (6). 


(4) a. O Tlétpoc óióáokerovy Mapia totopia. 
O Petros didaski sti Maria istoria. 
the Petros teaches to.the Maria history.Acc 
“Petros teaches history to Maria. 
b. O Ilérpog óióáoxei vy Mapia  toTopÍía. 
O Petros didaski ti Maria istoria. 
the Petros teaches the Maria.Acc history.Acc 


“Petros teaches Maria history: 
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(5) H Mapia diddéoKetat ıoTopia ano tov IHérpo. 
I Maria diöaskiete istoria apo ton Petro. 
the Maria is.taught history.acc by the Petros 


“Mary is taught history by Petros’ 


(6) *Iotopia dió4oxerar m Mapia ano tov IIérpo. 
Istoria didaskiete ti Maria apo ton Petro. 
history is.taught the Maria.acc by the Petros 


(7) Ioropia diddéoKetaiotn Mapia ano tov Iérpo. 
Istoria didaskiete sti Maria apo ton Petro. 
history is.taught to.the Maria by the Petros 


“History is taught to Mary by Petros. 


But are OBJ and OBJ» that have been modeled on compositional data enough 
to capture MWE behavior? This is how the original question, namely whether 
"compositional" syntax is appropriate for MWEs, may be couched in an LFG 
framework. The discussion in the remainder of this paper is structured as follows: 
at the second part of 81 we present the diagnostics for distinguishing between 
the two types of object that are available in LFG, namely the OBJ and the OBJo. 
In 82 we apply the classical constituency diagnostics on MWEs in order to iden- 
tify the constituents that will instantiate the GFs. In 83, we apply the objecthood 
diagnostics on the constituents identified within MWEs and compare the results 
with the ones received from the application of the same diagnostics on composi- 
tional structures. Passives are discussed in $4. In $5 we discuss the results of the 
application of objecthood diagnostics on MWEs, the pros and the cons of four 
different answers to our original question and argue in favor ofthe adoption of a 
new GF, which we call FIX. Finally, in 86 we show that a variety of MWEs can be 
modeled with FIX. We conclude with a set of questions open to future research. 


1.2 Diagnostics for distinguishing between OBJ and OBJ, 


Hudson (1992) has discussed the following 11 diagnostics for distinguishing be- 
tween English direct and indirect objects, OBJ and OBJg respectively in LFG 
terms: passivisation, extraction, placement after a particle, participation in heavy- 
NP shift, accusative case in a true case system, lexical subcategorisation, bearing 
the same semantic role as the prototypical direct object, animacy, existence of id- 
ioms with the same verb head, being the extractee of an infinitival complement, 
controling a depictive predicate. Although some of these diagnostics have been 
shown to be disputable (Thomas 2012), they still provide an excellent starting 
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point that we will adapt to the needs of Modern Greek. Modern Greek hardly 
uses any verb+particle constructs and has no infinitivals. Of the remaining di- 
agnostics lexical subcategorisation, heavy NP shift, animacy and control of a 
depictive predicate do not apply to MWEs that have fixed structures and non- 
compositional semantics. The idiom-based diagnostic is left out because fixed ex- 
pressions are idioms. Lastly, the extraction diagnostic will be used as a diagnostic 
of constituency. 

We will not use semantic roles as a diagnostic because of their inherent fuzzi- 
ness (Dowty 1990) and because MWEs have non-compositional semantics. LFG 
assumes that OBJ can bear any or no thematic role at all since expletives can also 
materialize objects. It is generally accepted that Modern Greek has no overt exple- 
tives (Kotzoglou 2001). OBJe, on the other hand, has been restricted to “themes” 
(Bresnan & Moshi 1990). 

The NP that instantiates an OBJg never turns up as the subject in passives (6) 
while the NP that instantiates an OBJ does (5), (7). 

The case diagnostic yields ambiguous results in Modern Greek because direct 
and indirect objects and a range of adjuncts denoting time and place are instanti- 
ated with accusative NPs: ofthe two accusative NPs in (8), the NP éva ypéupa (ena 
yrama) “a letter’ functions as an object while the NP tyv IIapaokevn (tin Parask- 
ievi) ‘on Friday’ is an adjunct that can be questioned with zóre (pote) “when”. 


(8 Oa ypáyo Eva yoda otov Koora tyv IHapooxevrj. 
Oa yrapso ena yrama ston Kosta tin Paraskievi. 
will write.1sc a  letter.Acc to.the Kostas the Friday.Acc 


T will write a letter to Kostas on Friday. 


Other diagnostics found in the literature seem to be language specific (Shi- 
Ching 2008). One of them is the position of the object in the sentence. In Modern 
Greek, normally both OBJ and OBJg follow the verb. Modern Greek is a language 
with relatively free word order. Adjuncts can appear anywhere in the sentence 
between constituents (the exact positions depend on the type of the adjunct). 

We will enrich our collection of diagnostics with various types of pronomi- 
nalisation including relativisation (9), Who/What-questions (10), (11) and clitic 
replacement (12). Pronominalisation has been used as a constituency diagnostic 
(Radford 1988). In certain languages relativisation has been used as a diagnostic 
for distinguishing between OBJ and OBJg: in Cantonese (Shi-Ching 2008), the 
OBJ of monotransitive verbs and the OBJg in ditransitive constructions are rela- 
tivised with a gap while the OBJ of ditransitive constructions is relativised with 
a resumptive pronoun. Modern Greek does not have similar pronominalisation 
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phenomena but we will see that relativisation is of some interest. We will also 
use Which-questioning (10), which has been adopted by Shi-Ching (2008) in her 
discussion of OBJ/OBJg in Cantonese and has been briefly discussed in Kay & 
Sag (2014), as well as clitic replacement (12). 


(9) O xóixac rov Nati tov ozoío Eonaoe o AAav Tiovpyix... 
O koóikas ton Nazi ton opio espase o Alan Turing... 


"Ihe Nazi code that Alan Turing broke ..? 


(10) Ilowv kóóika éozace o AAav Tiobpiyk; 
Pion koóika espase o Alan Turing? 


"Which code did Alan Turing break?’ 


(11) Tí éonace o AAav Tiovpyix; 
Ti espase o Alan Turing? 


"What did Alan Turing break?' 


(12) Tov éonace o Alav Toto, 
Tonespase o Alan Turing. 
him broke.3sc the Alan.NoM Turing.NOM 


‘Alan Turing broke it. 


We will adopt the standard assumption that Modern Greek OBJ/OBJo are ma- 
terialized as phrasal constituents when they are not materialized by weak pro- 
nouns (clitics). Modern Greek widely uses pre-verbal clitics, which have been 
analysed both as NPs and as affixes (Joseph 1989). We do not think that the 
phrasal status of clitics bears on the issues examined here. 


2 Multiwords 


Word order permutations, adverb placement and control phenomena indicate the 
presence of phrasal constituents in Modern Greek MWEs. Drawing on Kay & 
Sag (2014) and Samaridi & Markantonatou (2014), we assume that Modern Greek 
free subject verb MWEs contain an idiomatic verb predicate that selects for a free 
subject and a number (including zero) of (possibly) idiomatic complements. 


2.1 Constituency diagnostics 


Radford (1988) mentions preposing, postposing and adverb interpolation as dis- 
tributional diagnostics of phrasal constituents. We will use the term WORD ORDER 
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PERMUTATIONS to collectively refer to preposing and postposing. 

Because we are working with MWEs that contain postverbal NPs -often of 
some complexity- we note that in Modern Greek, postnominal genitive NPs or 
weak pronouns denoting possession or some property and postnominal PPs can- 
not be extracted from the matrix NP (13b), (14b). The matrix NP! participates in 
Word order permutations (13c), (14c). 


(13) a. 


(14) a. 


b. 


O Tiávvnc popdeı ta TAMTOÚUTOLA TOV Tıopyov. 
O Tianis forai [ta paputsia tu Tioryu]. 
the John wears the shoes the.GEN George.GEN 


‘John wears George's shoes: 
* Tov I'iópyov popaeı o I'i&vvrg ta TATOÚTOLO. 
Tu lioryu forai olianis ta paputsia. 


Ta manovtoia tov I'iópyov popaeı o I'i&vvng. 
[Ta paputsia tu Tioryu] forai o Tianis. 


H EAévg ayöpaoe Eva tai yix yAukd. 
I Eleni ayorase [ena tapsi yia ylika]. 
the Eleni bought a tin for cakes 

‘Eleni bought a tin for cakes’ 

“Tia yAvké ayópace y EAEvn Eva tai. 
Iia ylika ayorase i Eleni ena tapsi. 
Eva tayi yia vim ayopace y Edévn. 
[Ena tapsi yia ylika] ayorase i Eleni. 


Furthermore, a temporal adverb may occur between the verb and its NP com- 
plement (15a), (16a) but it cannot occur within the NP (15b), (16b): 


(15) a. 


b. 


O Tiávvnc qópeoe xBéc Ta TOTOÚTOLO TOU I iÓpyov. 
O Tianis forese x0es [ta paputsia tu Tioryu]. 


the John wore yesterday the shoes the George.GEN 
“John wore George's shoes yesterday. 


* *O Tıavvng qópeoe ta nanovtoia y0eg tov l'iópyov. 
O Tianis forese ta paputsia yes tu Tioryu. 


"The matrix NP is placed in brackets "TI" in the examples (13)-(16). 


192 


7 Revisiting the grammatical function “object” (OBJ and OBJo) 


(16) a. H Elévn ayopace y0ec Eva tawi yix yàvká. 
I Eleni ayorase xdes [ena tapsi yia ylika]. 
the Eleni bought yesterday a tin for cakes 
'Eleni bought a tin for cakes yesterday: 

b. * HElévn ayópace Eva tawi y0eg yia yAvká. 
I Eleni ayorase ena tapsi xdes yia ylika. 


Radford (1988) notes that pronouns such as ‘what’ can be used to question NP 
constituents irrespectively of their syntactic function, namely whether they are 
subjects (17), objects (18) or complements of prepositions (19), as well as a range 
of sentential complements. 


(17) Ti ype to npwi; To rpaívo. 
Ti irüe to proi? To treno. 
what.NoM came the morning the train 


"What came in the morning? The train did. 


(18) Ti popasıo  liávvgg; Ta namovtoia tov. 
Ti forai o Tianis? Ta paputsia tu. 
what.Acc wears the John.Nom the shoes his 


"What does John wear? His shoes: 


(19) Aró ti KptUwoe n EAevn; Ano tov oépa. 
Apo ti kriose i Eleni? Apo ton aera. 
from what caught.cold the Eleni.noM from the wind 


"What gave a cold to Eleni? The wind’ 


We will use these diagnostics to identify phrasal constituents in MWEs. 


2.2 MWE constituents 


Below we will use two types of verb MWE that admit a free subject (not a fixed 
one): 


1. The first type is represented with the verb MWE (20) and contains an ac- 
cusative NP that is an independent nominal MWE. We know that it is inde- 
pendent because it can combine with several verbs and it is synonymous 
with the noun permission. We will use the label NP_MWE to refer to this 
type of nominal MWEs. 
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(20) Eöwoe to mpácivo pus ye to Erasmus+. 
Eöose to prasino fos yiato Erasmus+. 
gave the green.Acc light.Acc for the Erasmus+ 


'S/He gave the green light for Erasmus+? 


2. The second type contains fixed accusative NPs that do not form indepen- 
dent NP MWEs. We will use the label Fixed NP to denote this type of 
NP that here is represented with three verb MWEs admitting a free sub- 
ject. Two of them involve the Fixed NP ra povtpa POSS (ta mutra POSS) 
where the obligatory POSS anaphor is bound by the subject (22), (23). The 
noun joúrpa (mutra) face’ is a colloquial word (21). Within the MWEs, 
the Fixed NP ra povtpa POSS (ta mutra POSS) does not have the meaning 


‘POSS face’. 
(21 Illúve Ta povtpa oov zov elvar pes th popa. 
Pline ta mutra su pu ine mesti vroma. 


wash.ImP the face.Acc yours.GEN thatis in the dirt 


"Wash your face that is very dirty: 


(22) Píyvo ta povtpa pov. 
Riyno ta mutra mu. 
drop.1sg the face.acc mine.GEN 


‘I suppress my dignity’ 


(23) Kowó ta povtpa pov. 
Kito ta mutra mu. 
look.1sg the face.acc mine.GEN 


‘I look at myself? 


Word order permutations (24a)-(24b), adverb interpolation (25a)-(25b) and What- 
questioning (26a)-(26b) establish that the NP ra povrpa POSS (ta mutra POSS) is 
a constituent of the respective MWEs: 


(24) a. Ta povtpa cov va pí£eig. 
Ta mutra su na riksis. 
the face.Acc yours.GEN to drop.2SG 


'It is your dignity that you should suppress: 
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Ta povrpa cov Kolta. 
Ta mutra su kita. 
the face.Acc yours.GEN look.2SG.IMP 


“Look at yourself. 


O Tiávvnc épige ` "äre Ta HOÚTPO Tov. 
O Tianis erikse tote ta mutra tu. 
the John dropped then the face his 
“Then John suppressed his dignity: 

H EAévg Koita&e tote ta povtpa TNG. 

I Eleni kitakse tote ta mutra tis. 
the Eleni looked then the face ^ hers 


‘Eleni looked at herself for once’ 


Epiée tote ta povtpa tov. Ti Epıke; 

Erikse tote ta mutra tu. Ti erikse? 

dropped then the face his what dropped 

“He suppressed his dignity for once. What did he do?” 
H EAévg xoítage ta poúrtpa mc JI voirie 

I Eleni kitakse ta mutra tis. Ti kitakse? 

the Eleni looked the face hers what looked 


*Eleni looked at herself. What did she do?' 


3 OBJ, OBJ,: Syntactic reflexes 


3.1 Objecthood diagnostics and the Fixed_NP 


Constituency diagnostics seem to set apart structures with an NP MWE from 
structures with a Fixed NP. 

The passivisation diagnostic returns a range of results: (20) has a passive coun- 
terpart (27a) but (23) and (24) do not (examples (27b) and (27c) respectively): 


(27) a. 


AóBnke To mpáocivo pws yıa ty óóor. 
Ao0ikie to prasino fos yiati dosi. 
was.given the green.NoM light.Nom for the instalment 


“Permission for the instalment was given? 
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b. * Ta povtpa pov xkortáxtnkav (ano epéva). 
Ta mutra mu kitaytikan (apo emena). 
the face mine was.looked.atby me 
‘I looked at myself? 

c. "Na pıyrovv ta povtpa cov (ano eoéva). 
Na riytun ta mutra su (apo esena). 
to be.dropped the face yours by you 


‘It is your dignity that you should suppress: 


The relativisation diagnostic yields similar results: (20) does not block relative 
clauses targeting the NP MWE (28) while (22) and (23) block relative clauses 
with the Fixed NP as a target (29). 


(28) To mpáoivo pws to oxoío éówoe y EE orovg aypóreg 
to prasino fos to opio edosei EE stus ayrotes 
the green light the that gave the EU to.the farmers 


'the green light that EU gave to the farmers' 


(29) * Ta povtpa cov, zov épléec TÓTE, væ TA Éavopíteig. 
Ta mutra su, pu erikses tote, nata ksanariksis. 
the face yours that dropped.2sc then to them re.drop.2sG 


"You suppressed your dignity then and you should suppress it again: 


The Which-questions diagnostic returns similar results: NP MWESs (20a) allow 
for which-questions (30) but Fixed NP (22),(23) do not (31). 


(30) ? IIo:  zpácivo pus Eöwoen Euvpwraikn Evwon; 
Pio  prasino fos edosei  Evropaiki Enosi? 
which green light gave the European Union 


"Which permission did the EU give?’ 


(31) is a piece of dialogue that was evaluated by 6 native speakers who were 
instructed to choose one of the following three labels: “joke”, “description of an 
event”, “other”. All speakers chose the label “joke”. The joke, irony or pun effects 
seem to be due to the fact that the question 7roio kënt (pio xieri) "which hand’ is 
unexpected in the context of the MWE. The MWE does not imply that someone 
actually put his/her hand in the fire while the question 7roio yép1 (pio xieri) shifts 
discourse to the literal meaning of yépi (xieri) “hand”. Raskin (1985) argues that 
jokes arise from the violation of the Gricean conversational maxims that require 


information-bearing and serious and sincere communication. 
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(31) Báto ro xép. pov omų  qoná&ór o  Kóoragée. Tow  yépu 
Vazo to xieri mu sti fotia oti o Kostas zi. Pio  xieri? 
put the hand my in.the fire that the Kostas lives which hand 


‘I am absolutely sure that Kostas is alive. Which hand?’ 


The replacement with a clitic in discourse with the same MWE produces an 
interesting effect: as expected, (20) allows for cliticisation of the NP MWE within 
the same expression (32), however, definite Fixed NPs also allow for cliticisation 
with the same MWE (33): 


(32) Eöwoe to mpáocivo pws yw ro Erasmus+; Nou, ro Eöwoe. 
Eöose to prasino fos yiato Erasmus+? Ne, to edose. 
gave thegreen light for the Erasmus+? yes, it gave 


“Did s/he give the green light for Erasmus+? Yes, s/he did? 


(33) was also evaluated by 6 speakers who were instructed to choose one of the 
following three labels: “joke”, “description of an event”, “other”. They all chose 
the label “description of an event”. Therefore, the clitic ra (ta) ‘them’ can be used 


to replace objects in the context of the same MWE. 


(33) Oa pitw ta povtpa pov. Eyó dev ta  —píyvo. 
Oa riksota mutra mu. Eyo den ta rixno. 
will drop the face.PL¿ mineI not them; drop 


T will suppress my dignity. I will not! 


Tsimpli & Mastropavlou (2007) following work by Cardinaletti & Starke (1999) 
and Tsimpli & Stavrakaki (1999) argue that Modern Greek third person clitics are 
"clusters of agreement and case features" and that they lack a referential index -a 
fact that explains their need of an antecedent. We can safely assume that cross- 
reference across same MWES satisfies agreement and case features and makes 
sure that semantics is identical across structures. 

Indefinite Fixed NP cannot be replaced by a clitic even in the context of the 
same MWE (35). Compositional structures (34) allow for clitic replacement of 
indefinite objects, even across different predications. 


(34) O Tiópyos Erade otv  EÀévg óàwekonég. Tig oxedráler Kaıpo. 
O Tioryos etakse stin Eleni diakopes. Tis sxieöiazi kiero. 
the George promised to.the Eleni holidays them plans time 
“George has promised a holiday to Eleni. He has been planning it for 
some time. 
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(35) to promise hares with stoles “to make unrealistic promises’ 


Etale Aayovcpe metpayrndia. "Tovg étale TTAVTOÚ. 
Etaze layus me petraxilia. "Tus etaze pantu. 
promised hares; with stoles them; promised everywhere 


“He made unrealistic promises. He made these promises to everyone: 


Ariel (2001), in the context of Accessibility Theory, argues that "referring ex- 
pressions code a specific and (different) degree of mental accessibility" where 
“mental accessibility" is meant as a shorthand of "accessibility of mental repre- 
sentations that are available to the addressee in the discourse". Referential expres- 
sions are accessibility markers guiding the addressee how to retrieve appropriate 
mental representations. Drawing on distributional findings, Ariel suggests an or- 
dering of referential expressions from low to high accessibility markers. On this 
ordering, definite expressions are situated on the edge of low accessibility mark- 
ing and 3" person clitics on the edge of high accessibility marking. This means 
that the addressee perceives definiteness as a signal that an entity has just been 
introduced to the discourse and the existence of a clitic as a signal that she has 
to look for an entity that has been introduced to the discourse sometime ago. 
Therefore, definiteness should "attract", so to say, clitics. Perhaps, definiteness is 
the reason why (only) definite Fixed NP can be replaced with a clitic. The reader 
should keep in mind that replacement of a Fixed NP with a clitic is allowed only 
in the strict context of the same MWE and that indefinite Fixed NP cannot be 
replaced (35). 

Lastly, discourse collapses if cross-reference is required across different MWEs 
(36) and across MWEs and compositional structures (37) (compositional struc- 
tures allow for cross-reference across different predications). (36) and (37) below 
sound absurd. At best, (37) produces a joke/irony effect - an effect that was ob- 
served with Which-questions as well. 


(36) *O Ilérpoc épige Ta povtpa TOV Kat uerá ta Koitaée. 
O Petros erikse ta mutra tu kie meta ta kitakse. 
the Petros dropped the face pt: his and then them; looked 


‘Petros suppressed his dignity and then he looked at himself? 


(37) “Epiéa ta povtpa pov. Ta ` etya Koller npiv. 
Eriksa ta mutra mu. Ta iya kalipsi prin. 
dropped the face.PL; mine. them; had covered before 


‘I suppressed my dignity. I had covered my face in advance. 


198 


7 Revisiting the grammatical function “object” (OBJ and OBJo) 


English MWESs present a picture similar to the Modern Greek one. Kay & Sag 
(2014) discuss the case of the English verb MWE to kick the bucket and apply sim- 
ilar diagnostics. The MWE to kick the bucket resists passivization. Furthermore, 
relativisation, Which-questioning and replacement of the bucket with it? are not 
possible (38a)-(38c). 


(38) a. * the bucket that the peasant kicked ... 
b. * Which bucket did the peasant kick? 
c. The peasant kicked the bucket. * Also, his wife kicked it. 


3.2 Application of objecthood diagnostics on OBJ, 


The accusative NP tyv eAAnvixr vocopío, tin eliniki istoria, ‘the Greek history’ in 
(39) instantiates an OBJ and responds positively to all constituency diagnostics.? 
In (40) the definite NP tyv eAAnvıy ıoropia (tin eliniki istoria) instantiates an 
OBJ». 


(39 O Tlétpoc d1ödoreı ommv | konéAa mv eAAnvicm 10Topia. 
O Petros didaski stin kopela tin eliniki istoria. 
the Petros teaches to.the girl the Greek.acc history.Acc 


“Petros teaches the Greek history to the girl! 


(40) O llérpog óió&oka& tyv korréda tyv eAAnvikrj totopia. 
O Petros didaski tin kopela tin eliniki istoria. 
the Peter teaches the girl.Acc the Greek.Acc history.acc 


“Peter teaches the girl the Greek history: 


We have already illustrated with examples (5)-(7) that the Modern Greek OBJg 
patterns with the English OBJg as regards passivisation. 

Relativisation is somehow unwelcome with an OBJg: (41a), (41b) were accepted 
as grammatical by 50% of the speakers. 


(41) a. y xoméÀa mov diddéoxero — Iérpog tyv eAAmnvikij iotopia 
i kopela pu didaski o Petros tin eliniki istoria 
the girl.NoM who teaches the Petros the Greek ` history.Acc 


'the girl to whom Petros teaches the Greek history' 


? ft is the nearest English equivalent of Modern Greek clitics. 
?However, it must be noted that 5 out of the 7 speakers who commented on (39) and especially 
(40) thought them acceptable but somewhat clumsy. 
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b. y eAAmgvikij totopia zov óilóácker o  Ilérpoc trjv kozéAa 
i eliniki istoria pu didaski o Petros ti kopela 
the Greek — history.NoM that teaches the Petros the girl.Acc 


“the Greek history that Petros teaches to the girl’ 


The Which-questions diagnostic returns a variety of results: (42a) was rejected 
by all the speakers while (42b) was accepted as grammatical by a 50% of the 
speakers. 


(42) a. "Dog konéAa diöcdoreı o — IIérpog tyv eAAnvixy iotopía; 
Pia kopela didaski o Petros tin eliniki istoria? 
which girl ^ teaches the Petros the Greek history.Acc 


b. Tow ıoropia Óióáokev o ` IHérpog tyv Koneig; 
Pia istoria didaski o Petros tin kopela? 
which history teaches the Petros the girl? 


While OBJ can be replaced with a clitic (43a), replacement of OBJg with a clitic 
is not possible in discourse with the same predication (43b). 


(43 a. O Ilérpoc tv ÓLOGOKEL THV EAAQVIKY vo Topía. 
O Petros tin(‘girl’) Sidaski tin eliniki istoria. 
the Petros her teaches the Greek history 
“Petros teaches her the Greek history: 

b. *O Ilérpog mv óióáockel TYV kozéAa. 
O Petros tin(‘history’) didaski tin kopela. 
the Petros it teaches the girl 


Replacement of an OBJ with a clitic is possible in a discourse with a different 
predication. In (44), the clitic 77v (tin) ‘her’ may refer to either an NP instantiating 
an OBJ (tnv Mapia (tin Maria) Maria") or to the complement of a P (otnv Mapia 
(stin Maria) ‘to Maria’). Furthermore, the clitic tyv (tin) ‘her’ in the second clause 
refers to the NP tyv eAAnvırm totopia (tin eliniki istoria) ‘the Greek history’ that 
instantiates the OBJa. 

(44 O Ilérpog óióáoke or Mapia/ tm Mapia tnv eAAmvikij ıotopia. Tyv 
O Petros didaski sti Maria /ti Maria tin eliniki istoria. Tin 
the Petros teaches to.the Maria / the Maria the Greek history her 
EXEL KAVEL VÆ TNV ayanınoeı. 
exi kani natin ayapisi. 
has made to it like 
‘Petros teaches Maria the Greek history. He has made her love it’ 
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Similar results are received if the same diagnostics are applied on English OBJ) 
(Thomas 2012): the English OBJg cannot be replaced by it (45). 


(45) * John gave Mary it. 


3.3 The overall syntactic behavior of OBJ, OBJ, and of the (yet 
unknown) GF assigned to Fixed NP 


The results of the application of the diagnostics on the GF assigned to Fixed NP, 
OBJ, OBJg and ADJ instantiated with accusative NPs including optionality, case 
marking and position in the sentence are summarized in Table 1. We have not 
provided detailed data for the application of the diagnostics on ADJ. 

Direct objects can be optional in Modern Greek (Anastasopoulos et al. 2013). 
Kordoni (2004) presents Modern Greek data where OBJg is omitted. MWEs, on 
the other hand, hardly allow for constituent omission. 


Table 1: The overall syntactic behavior of OBJ, OBJg, ADJ, and the GF 
assigned to F(ixed) NP according to the objecthood diagnostics.” 


Phenomenon FNP FNP OBJ OBJg OBJo NP adj 


Language EL EN EL EL EN EL 
Optionality N N Y Y N Y 
Relativisation N N Y ?Y Y Y 
Which-questions N N Y ?Y Y Y 
Clitic-same MWE Y N* Y N N* N 
Clitic-different MWE N N* Y Y N* N 
Clitic-compositional N N* Y Y N* N 
Accusative Postverbal Y Y Y Y Y Y/N 
Passivisation N N Ys N N N 


“Clarifications on Table 1: 

1. F NP: it stands for Fixed NP. 
. N*: English has no clitics. We refer to the usage of the pronoun it - see (38c) and (45). 
. Ys: Not all transitive verbs have passive counterparts in Modern Greek. 


. ?Y: Speakers responses were not unanimous. 


aon Pr Cc N 


. Y/N: Modern Greek accusative NP adjuncts can appear in both pre- and post- verbal 
positions. 
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The feature "accusative postverbal" takes the same value for all the examined 
categories and has no discriminating role, therefore it will not be taken into ac- 
count in the remainder of this discussion. Furthermore, ADJ, OBJ and OBJg re- 
spond positively to relativisation and Which-questions, indicating that the two 
diagnostics are sensitive to the semantics of the NPs rather than their syntac- 
tic function (Kay & Sag 2014). These diagnostics will not be used as objecthood 
diagnostics for Modern Greek or English. 

A more detailed picture of the situation with passivisation in our collection of 
Modern Greek verb MWEs is given in the next section. 


4 A more detailed picture of passivisation in Modern 
Greek MWEs 


Out of a collection of 1120 verb MWEs* a percentage of 57,5% are formed with 
verbs that have a passive counterpart. The remaining 42,5% are formed with verbs 
that have no passive counterpart. Of the MWEs that are formed with verbs that 
have a passive counterpart in the general language, only 53 have a passive MWE 
counterpart. Among the passivisable MWEs, 24 contain a free accusative NP that 
becomes the subject of the passive form (46), 6 contain an NP MWE (27) and 23 
contain a Fixed NP. Of the MWEs that are formed with passivisable verbs but do 
not have a passive MWE counterpart, 76 contain a free accusative NP, 24 contain 
an accusative NP MWE and 221 contain a Fixed. NP. Percentages in Table 2 are 
calculated over the whole data set (1120 MWEs). 


(46) O ópoç kowvórgra ... apéOnke ornv iotopiký Novyia Tov. 
O oros kinotita  ...afe0ikie stin  istoriki isixia tu. 
the term community ... was-left to.the historical peace its 


“The term community was left alone in its historical peace. 
http://commonsfest.info/2015/i-istoria-ton-kinon-ston-elliniko-choro/ 


Several of the passivisable MWEs contain Fixed NP whose head nouns seem 
to instantiate senses different from the nouns’ literal ones. For instance, the noun 
plétpa (metra) ‘meters’, is used with the sense ‘measures’ in (47). Such senses are 
used widely in compositional structures. Along with idioms, the collection used 
also includes collocations. 


“http://users.sch.gr/samaridi/attachments/article/3/LexicalResources.pdf 
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Table 2: Passives in the dataset of Modern Greek free subject verb 


MWEs 

Verbs Total MWE Total Complement Total 
passive 644 passive 53 Free NP 24 (2,1%) 
(57,5% ) (4,7%) NP. MWE 6 (0,54%) 
Fixed NP 23 (2%) 
no passive 591 Free NP 76 (6,8%) 
(52,7%) NP MWE 24 (2,1%) 
Fixed NP 221 (19,7%) 

no passive 426 

(42,5%) 

Total 1120 


(47) Avra eivaıra pétpa mov katédeoe m EAAnvırr koPépvnon. 
Afta ine ta metra pu kateOese i eliniki kivernisi. 
these are the measures that submitted the Greek government 


“These are the measures that the Greek government submitted. 


If these collocations are put aside, only a percentage of 1% corresponds to pas- 
sivisable MWEs with a Fixed NP. In (48) the Fixed NP peyáda Aoyıa (meyala 
loyia) ‘big words’ is the subject of the passive form of the MWE A&w peycAa 
Aoyia (leo meyala loyia) ‘to make big promises”. 


(48) Eivaı ouvndes vo Aéyovroi peyáda Aöyıa ano piKpovs moAitiKoUs. 
Ine sinides naleyonte meyalaloyia apo mikrus politikus. 
is common to say.PAss big words by small politicians 


‘Often unimportant politicians make big promises: 


The collection we have used is of relatively medium size but clearly shows that 
Modern Greek MWEs do not prefer passivisation: passivisable MWEs (both fixed 
ones and collocations) account only for the 4,7% of the total number of MWEs. 


5 OBJ, OBJ, or some NEW GF? 


We are turning now to our main question, namely whether OBJ or OBJ can be 
assigned to Fixed_NP or whether a new GF (LFG) should be defined. In what 
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follows we will use the collective term “meaning preserving NPs” for Fixed NP 
with heads with independent, non literal senses, accusative NP MWE and, of 
course, for free accusative NPs. The picture that has emerged so far reveals three 
groups of verb MWE: 

Group 1: The group of passivisable verb MWEs that contain meaning preserv- 
ing NPs and satisfy objecthood diagnostics; it comprises the majority of passivis- 
able Modern Greek MWEs. 

Group 2: The group of non passivisable verb MWEs containing both meaning 
preserving NPs and Fixed NP. 

Group 3: The rather small group (1%) of passivisable verb MWESs that contain 
Fixed NP. 

We can safely say that Group 1 contains verb MWEs whose verbal head selects 
for an OBJ because all obejcthood diagnostics are satisfied. In LFG, passivisation 
is modeled with a lexical rule that takes as input an active transitive predicate 
and maps the active OBJ on the SUBJ of the output passive predicate and the 
active SUBJ on an adjunct of the passive predicate. We assume that the LFG lex- 
ical rule for passivisation that requires an OBJ applies normally on these MWEs. 
Furthermore, an OBJ function can be assigned to passivisable verb MWEs with a 
Fixed NP that constitute Group 3; the set of such verb MWES is very small and it 
will be harmless to consider them as idiosyncratic (further research might reveal 
interesting aspects of these Fixed NP). 

Group 2 comprises verb MWEs that do not passivise but contain both meaning 
preserving NPs that satisfy objecthood diagnostics except for passivisation, and 
Fixed NP that satisfy only clitic replacement in the same MWE context provided 
they are definite. 

Kay & Sag (2014) discuss a similar distribution of English MWEs. In order to 
model the dichotomy introduced by passivisable versus non-passivisable MWEs, 
they split verbs into real transitive and pseudo-transitive ones.? Real transitive 
verbs correspond to Group 1 above. The class of pseudo transitive verbs of Kay 
and Sag includes verbs of measurement such as cost, weigh, measure and MWEs 
with Fixed NP such as to kick the bucket, therefore pseudo-transitive verbs can 
be considered a superset of Group 2. By definition then, pseudo-transitive verbs 
do not select real objects therefore they do not passivise. Furthermore, Kay and 


"In the revised version of the manuscript http://www1.icsi.berkeley.edu/ kay/idiom-pdflatex.11- 
13-15.pdf the transitive/pseudo-transivite dichotomy has been replaced with the distinction 
between meaningful and meaningless idiomatic complements of idiomatic verb predicates, the 
assumption being that passivisation applies on meaningful objects. Of course, in compositional 
language there are several verbs that accept meaningful objects and still do not passivise while 
expletives do turn up as subjects of passive verbs. 
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Sag observe that (like Modern Greek MWEs) several English MWEs with fixed 
NPs fail the relativisation and Which-question objecthood diagnostics; however, 
they note that the failure can be explained by semantic or pragmatic constraints 
on the diagnostics. Passivisation cannot be considered a semantics sensitive diag- 
nostic because expletives and Fixed NP turn up as subjects of passivised MWEs. 
Therefore, the proposed splitting of verbs into transitive and pseudo-transitive 
ones draws on passivisation ability solely and membership in each of the two 
groups is a lexical property of the verb. 

The Kay & Sag (2014) approach that we have discussed so far relies on the 
verb predicate in order to explain the non-uniform behavior of “objects”. Doug 
Arnold (University of Essex, personal communication) has suggested an alter- 
native approach, namely that the Fixed NP could be blamed for the scarcity of 
MWE passives. The two approaches, the verb predicate oriented and the Fixed - 
NP oriented one, can be transcribed in LFG in one of the four ways below: 


1. (verb predicate oriented): Some feature of the type +/-PASSIVISES is defined 
in the lexical entry of the verb and the OBJ GF is assigned to Fixed NP 


2. (verb predicate oriented): The verb does not select an OBJ; rather it selects 
some other GF and this is why the passivisation lexical rule that requires 
an OBJ cannot be applied 


3. (Fixed NP oriented): The head ofthe Fixed NP is associated with the inside- 
out constraint (OBJ^) in the lexicon (Doug Arnold's proposal); the result 
of the constraint is that the Fixed NP is able to realise only the OBJ GF and 
no other GF. 


4. (Fixed NP oriented): The case of the Fixed NP is fixed to Acc (accusative). 


Hypotheses 3 and 4 seem to be equivalent in the case of Modern Greek and 
English where subjects of main clauses are marked with the nominative case. 
As a result, an NP inherently marked as Acc cannot instantiate a SUBJ GF. Con- 
sequently, this NP cannot participate in alternations that result in a change of 
case, such as passivisation and causative-inchoative alternation. The inside-out 
constraint (OBJ^) of hypothesis 3 has the same effect. However, there are pas- 
sivisable verbs in Modern Greek that head non-passivisable MWEs with a non- 
causative counterpart where the Fixed NP is the subject. For instance, the MWE 
avábo ta Aaimákia kázoiov (anavo ta labakia kapiu) ‘I make somebody angry” 
does not have a passive counterpart (49a) although it is headed by a causative 
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verb that has a passive counterpart in compositional language. However, the ex- 
pression has a non-causative counterpart (49b) where the Fixed NP ra Aaunakıa 
(ta labakia) turns up as a subject in the nominative case. 


(49) a. * Avdprmkav ta Aoumókix TOU Ilérpov ATTO euéva. 
Anaftikan ta labakia tu Petru apo emena. 
turn.on.PASS the lights.Acc the.GEN Petros.GEN by me 


'I made Petros angry: 


b. Avayav ta AQUITÁKLOA TOU Ilérpov. 
Anapsan ta labakia tu Petru. 
turn.on.ACT the lights.NoM the.GEN Petros.GEN 


‘Petros got angry? 


In addition, there are causative/non-causative MWE pairs that are headed by 
different verbs such as the causative MWE (50a) and its non-causative counter- 
part (50b). Such examples suggest that the hypothetical constraint (OBJ^) origi- 
nates from the causative form of the verb and not from the Fixed NP. Further- 
more, the use of Fixed NP in titles as illustrated with example (51b)°, in particu- 
lar, the use of Fixed NP that feature in verb MWEs that have no non-causative 
counterpart (51a), suggests that the Fixed NP oriented approach should be aban- 
doned. 


(50) a. Píyvo Ta povtpa pov. 
Rixno ta mutra mu. 
drop.1sc the face.Acc mine 
'I suppress my dignity. ' 

b. Ileprovv ta povtpa pov. 
Peftun ta mutra mu. 
fall.3sc the face.NoM mine 


“My dignity is suppressed ’ 


(51) a. Ilivw ro TIKpó MOTH PL. 
Pino to pikro potiri. 
drink the bitter.Acc glass ACC 
T have a difficult time? 


“The conjunction in (51b) ensures the nom case of the Fixed NP. 
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b. ro mod gon, o AdAééns kal o  Kupiákog 
to pikro potiri, o Alexis kie o Kiriakos 
the bitter glass, the Alexis.Nom and the Kiriakos.NOM 


“the difficult time, Alexis and Kiriakos’ 
http://www.logiastarata.gr/2016/01/blog-post_194.html 


We now turn to the verb predicate oriented hypotheses. Hypothesis 2 suggests 
that the verb assigns to the Fixed NP some GF other than the OBJ GF. It would 
make sense to assume that Fixed NP instantiates OBJg if Fixed NP occurred in 
ditransitive constructions exclusively, but it occurs with a large variety of verbs. 
In addition, OBJg is restricted to themes; it would be risky to apply semantic 
roles on the idiomatic meanings ofFixed NP and of verbs in MWEs. Furthermore, 
OBJe cannot be replaced with a clitic but it can be omitted (Kordoni 2004). For 
all these reasons, the OBJg GF is an unattractive hypothesis for Fixed. NP. 

Hypothesis 1 suggests that OBJ is assigned to Fixed NP and some feature of 
the type +/-PASSIVISES is defined on the lexical entry of the verb. This is not a 
semantic feature because a robust theory that attributes passivisation to verbal 
semantics is not available yet. On the other hand, such a feature is needed any- 
way in LFG, otherwise the passivisation lexical rule will apply to verbs like ozráco 
(spao) ‘break’ (1) that select a SUBJ and an OBJ. 

However, hypothesis 1 is less principled than a GF-based approach. Features 
are dedicated to specific phenomena while GFs avail themselves to wider gen- 
eralisations, for instance OBJg has been used to encode the behavior of ditransi- 
tives and applicatives cross-linguistically (Bresnan & Moshi 1990). In the case of 
Fixed NP, apart from passivisation there is a need to encode two more facts that 
do not characterise OBJ and cannot be stated as a property of non-passivisable 
verbs: first, only Fixed. NP introduced with a definite article can be replaced with 
a clitic in Modern Greek while the English Fixed NP cannot be replaced with it, 
and second, Fixed NP are obligatory in both languages. 

In the light of the discussion above, one could be tempted to define a new GF 
that would be instantiated by Fixed NP. Let us call this GF FIX. The facts we have 
seen so far that favor the new GF approach, and would be the defining features 
of FIX, are the following: 


e Distributional/semantic: Fixed NP can be found only with MWEs 


e No passivisation: Fixed NP do not appear as subjects of passive MWEs 
(very strong tendency) 


* Replacement with a clitic: it is restricted to definite Fixed NP only 
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* Optionality: Fixed NP is hardly optional 


e Cross-linguistic evidence: Similar behavior is observed in at least two lan- 
guages, English and Modern Greek. 


We have already alluded to the fact that the combined effect of the OBJg and 
the proposed FIX is not enough to model the range of non-passivisable verbs. 
FIX could be assigned to Fixed NP and, probably, to the objects of measurement 
verbs as well as, generally, to verbs whose object cannot be assigned some clear 
semantic role. However, it would seem awkward to lump the Modern Greek typi- 
cally transitive but non-passivisable change of state verbs like ondw (spao) ‘break’ 
(1) together with MWEs and measurement verbs; change of state verbs clearly as- 
sign the Proto-Patient semantic role to their objects while it is hard to pin down 
the role that is assigned by measurement verbs and MWEs to the accusative NPs 
that we discuss here. A clearly unwelcome feature of the GF approach is that it 
leaves room for more object-like GFs that block passivisation and are selected by 
rather specific types of predicate, given that OBJg is selected by ditransitives and 
applicatives and FIX by MWE verbal heads only. Certainly, it would be prefer- 
able to keep the GF population small in size because GFs are primitive concepts 
of LFG (Dalrymple 2001). 

Despite the problems discussed above, we would opt for FIX, because it is 
more principled since it generalises over properties of English and Modern Greek 
MWEs. Below, we will attempt to support our preference with more facts drawn 
from Modern Greek MWEs. 


6 Words With Spaces and the FIX 


Fixed NPs comprise more complex phrasal structures than the ones we have seen 
so far. These may be of the type DETERMINER+ADJECTIVE+NOUN (51), NP.GEN+ 
NOUN’, or NOUN+NP.GEN or NOUN+PP (35). These MWEs do not passivise. (51), 
(52) can be replaced with a clitic within the same predication because 
Fixed NP is introduced with the definite article while the NP in (35) is not. 


(52) Epayav ty ckóvg tov Auayavriön. 
Efayan ti skoni tu Aiamantiói. 
ate.3PL the dust.Acc the Diamantidis.GEN 


“They were overtaken by Diamantidis. 


7NP.GEN+NOUN can be free or fixed; (50) exemplifies a free genitive NP. 
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In fact, a wider range of fixed strings behave as single complements of the 
MWE verb (Samaridi & Markantonatou 2014). Here we will exemplify the idea 
with a predication structure. 

The compositional equivalent of the fixed string in (53a) is that of an object 
that controls a predicative complement. The string ro Papi Yoyarı (to psomi 
psomaki) (53a) is fixed because its parts cannot be separated (53b) and no free 
XP can intervene (53c). At the same time, constituency diagnostics show that it 
is a constituent ((53a)-word order permutations, (53d)-temporal adverb interpo- 
lation) and can be questioned (53e). The fixed string is introduced with a definite 
article and can be replaced with a clitic in the context of the same MWE (53f). 
Therefore, ro Yop WapcKi (to psomi psomaki) behaves like a Fixed NP. 


(53) a. Ague To yopi wWeouáki /To yopi  Vouáxi ée. 
Leme [to psomi psomaki]. / [To psomi psomaki] leme. 
call.1Pr the bread little.bread 
"We are starving. 
b. * To psomi leme psomaki. / *Psomaki leme to psomi. 
c. *Ague To yAvkó dout xamnuévo Voyarkı. 

Lemeto yliko psomi kaimeno psomaki 

say the sweat bread poor little-bread 
d. Aé tépato yop Yopáki. 

Leme tora to psomi psomaki. 

call now the bread little-bread 


"We are starving now: 


o 


Tí Ane ropa; To pout Yoyanrı. 

Ti  leme tora? To psomi psomaki. 
what do.we.say now? the bread little-bread 

f  Aéue TO yopi Yoyudkı; Nau, to Ague. 
Leme to psomipsomaki? Ne, to leme. 
do.we.say the bread little-bread? yes, it we.say 


‘Are we starving? Yes, we are. 


The fixed string ro Wap Yoyakı (to psomi psomaki) is a Word With Spaces 
(WWS) (Sag et al. 2002) that satisfies constituency diagnostics. If ro Yopi Yoyarı 
(to psomi psomaki) is not treated as a WWS, additional constraints to block (53b) 
would be needed. Similar ideas have been discussed in Green et al. (2013), where 
the fixed parts of MWEs are represented as flat structures. In the examples above, 
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the idiomatic predicate Aéw (leo) ‘call’ assigns the FIX GF. Lack of a passive coun- 
terpart and clitic replacement follow from FIX normally. 

To represent structures like (52), where a free genitive NP occurs as part of 
the fixed structure of the MWE, the WWS rn_oxóvr (ti_skoni) selects for a POSS 
Grammatical Function. The POSS function will allow for the representation of 
binding phenomena that are often found with MWESs. For instance, (50a) is an 
example of a MWE where the possessive pronoun that complements the WWS 
Ta uoUrpa (ta mutra) is necessarily bound by the free subject of the idiomatic 
verb. 

In anutshell, the FIX GF seems to be instantiated exclusively by phrases headed 
by fixed strings, such as (53a), that may or may not be generated with the phrase 
structure rules devised for compositional structures. Along with other work on 
MWEs within the LFG framework (Attia 2006) we list fixed strings in the lexi- 
con. Treating WWSs as lexical entries deals with the problem of generating non- 
compositional fixed strings while FIX captures passivisation and replacement 
with a clitic. 


7 Conclusion 


We have argued that verbal MWEs that contain direct complements of verbs 
headed by fixed strings cannot be captured with exactly the same syntactic ma- 
chinery that has been developed for compositional structures. Despite appear- 
ances, fixed complements do not behave as direct or indirect objects with respect 
to a number of classical objecthood diagnostics. We argued that this special syn- 
tactic behaviour is identifiable at a syntactic functional level. If we are right, the 
syntactic apparatus that has been developed in LFG to represent the notion of 
“objecthood” in compositional structures has to be expanded to accommodate a 
new GF that we called FIX. The new GF is necessary for modeling a wide-spread 
type of MWEs. 

Certainly, several issues are left for future research: the range of syntactic phe- 
nomena involving the strings that instantiate FIX (modification, alternations as 
they are illustrated in (49b), (50b) and (51b) and pose questions concerning the 
treatment of MWEs with a fixed subject), control phenomena and, probably, the 
modeling of the switch from MWE to compositional contexts that gives rise to 
joke/irony/pun effects -a phenomenon that might be modeled more easily in 
terms of WWSs and FIX. 
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Abbreviations 
GF grammatical function NLP Natural Language Processing 
HPsG  Head-driven Phrase NP noun phrase 
Structure Grammar OBJ object 
LFG Lexical Functional OBJọ  OBJECTQ 
Grammar poss possessive grammatical 
MWE  multiword expression function 
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Multiword expressions and derivation have rarely been discussed together, even 
though analyzing the interaction between them is of great importance for the study 
of each topic and, in general, for the study of the language and for Natural Lan- 
guage Processing. Derivation is a means of enriching the lexicon with both words 
and multiword expressions. Various types of derivation (suffixation, prefixation or 
both, as well as other derivational devices) can act upon either words or multiword 
expressions. The focus of our work here is the formation of multiword expressions 
from other multiword expressions via derivation. We analyze the morphological, 
syntactic and semantic aspects of this process, providing examples from Roma- 
nian and Bulgarian, languages, which belong to different families but have been in 
contact throughout their history. The study can be further extended with data from 
other languages. The perspective adopted here is paradigmatic, but the syntagmatic 
approach, which can only be mentioned as further work, will add to the quality of 
the analysis of facts: corpus data will contextualize the phenomena discussed here 
and offer quantitative information about them. 


1 Introduction 


Widely accepted as a difficult task to deal with, the identification of multiword 
expressions (MWESs) in processing natural languages becomes even more difficult 
when the MWEs are new creations in the language or even ad-hoc creations in 
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the text as a result of the linguistic creativity of speakers, usually carrying an 
emotional load (1): 


(1) a bága de seamă - bágátor de seamă (RO) 


'to pay attention to' - '(the one) who only watches without playing any 
role (in the action)’ 


In example (1) the latter MWE is derived from the former and carries a negative 
connotation. 

While the interest in the origin of MWEs has been manifested in all languages, 
specialists have normally investigated the social, economic, ethnographic, and 
other aspects motivating the process of turning certain word combinations into 
MWEs. When the origins cannot be found in the national background, MWEs 
are attributed to other languages, so they are borrowings or linguistic calques. 
Another (language internal) source of MWEs can be found in the inventory of 
already existing MWEs. In this paper we focus on one type of MWE formation: 
derivation from other MWEs, as shown in (1). We put together two topics that 
have rarely been discussed together in the same study. 

On the one hand, MWEs have been classified and characterized according to 
syntactic and morphological variability (Nunberg et al. 1994; Sag et al. 2002; Bald- 
win et al. 2003; Baldwin & Kim 2010, among others) and/or semantic decom- 
posability (Nunberg et al. 1994; Baldwin et al. 2003, among others), as well as 
according to types of idiomaticity (Baldwin 2004; 2006, among others). From a 
morphological perspective, only inflection and the reflexive form of verbs were 
discussed for each type of MWE (Sag et al. 2002; Savary 2008). 

On the other hand, derivation is a process defined as involving words 
(Marouzeau 1933): it is the process of creating new words out of existing ones, 
by means of attaching or detaching affixes to or from a stem respectively, the lat- 
ter type being better known as back-formation. An example of derivation is the 
word survival, created by attaching the suffix -al to the stem survive. An example 
of back-formation is the verb to back-form, obtained from back-formation by re- 
moving the suffix -ation. However, derivation can act both on words and MWEs. 


1 : m 
As convention of writing: 


i We adopt the use of the international two letter code of the country in which the lan- 
D y 
guage is spoken in front of each example to mark the language to which it belongs: RO 
for Romanian, BG for Bulgarian. 


(ii) We show the base MWEs on the left and the derived MWEs on the right. 
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In the former case, it always results in a new word; in the latter, it creates either 
a new word or a new MWE, as we will show below. 

In the literature dedicated to either of the two topics (derivation, MWEs), one 
can identify two predominant trends: on the one hand, the discussion about 
derivation has always implied that words are the output and only rarely MWEs; 
on the other hand, the discussion about MWEs has implied, from time to time, 
reference to derivation: this interest has also been expressed, although sporadi- 
cally, in studies on phraseology, particularly in analyzing the behavior of idioms 
with respect to their derivational morphology. 

In this chapter, we describe the way derivation affects MWEs, providing ex- 
amples from Romanian and Bulgarian, languages which belong to different lan- 
guage families (Romance and Slavic, respectively) but have had a long history of 
contact. We focus on MWEs derived from other MWEs, highlighting morpholog- 
ical, syntactic and semantic modifications triggered by these transformations. 

In both Romanian and Bulgarian, derivation is much more productive than 
compounding or other internal means of enriching the vocabulary. Moreover, 
progressive derivation is more frequent than back-formation. In both languages 
suffixation is the prevalent derivational means. Prefixation in Romanian is much 
less productive. Bulgarian has a very developed deverbal verb formation as verbal 
prefixes express aktionsart and the language has a rich Aktionsart system. Cases 
of prefixation were not found in our data involving cross-part-of-speech deriva- 
tion. Derivation affects all content word classes, simple or compound words. 


2 Types of lexemes derived from MWEs 


When subject to derivation, MWEs can serve as bases for the creation of either 
other MWEs or words. We discuss these types in the subsections below. In our 
discussion, we will use the term BASE MWE to denote the MWE that serves as 
the input to the derivation process. 


2.1 MWEs derived from MWEs 


We offer here some examples of MWEs derived from MWEs, in both Romanian 
and Bulgarian: 


(2 a mustra cugetul (pe cineva)  — mustrare de cuget (RO) 
tochide the.conscience (on someone) - chiding by conscience 


€ > € x > 
to have remorse' - “having remorse 
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(3) coeecmma epuse (nakoo) - epusene Ha ceeecmma (BG) 
sávestta grize (nyakogo) - grizene na sávestta 
the.conscience gnaws (someone) - gnawing of the.conscience 


€ > H . > 
to have remorse’ - ‘having remorse 


(4) cronicá literará — cronicar literar (RO) 
‘literary review” - ‘literary reviewer’ 


(5) Moden dusaün — Moden dusaünep (BG) 
moden dizayn - moden dizayner 


“fashion design’ - fashion designer’ 


These examples show that different types of MWEs can feed derivations: id- 
ioms in (2) and (3), terms in (4) and compounds in (5). 

One content word (usually the syntactic head) of the base MWE is subject to 
affixation: e.g. in (2) above mustrare (noun) is derived from the verb a mustra 
with the suffix —re; in (4) cronicar is derived from cronicá (the head of the base 
MWE noun phrase) with the agentive suffix -ar. Likewise, in (3) the process of 
derivation is carried out by means of suffixation of the verb epusa (griza) “gnaw” 
with the suffix —ne, thus obtaining the deverbal noun epusene (grizene) ‘gnawing’, 
and in (5) the head noun 0usaünep (dizayner) ‘designer’ is obtained from the noun 
nusaitH, dizayn, ‘design’ by means of the agentive suffix -ep. 


2.2 Words derived from MWEs 


When one content word of the MWE is subject to affixation and the derived word 
has the semantic content of the base MWE, we regard this as words being derived 
from MWEs; the other words of the base MWE simply do not occur in the result 
of the derivation: 


(6) a face un lucru musama - a musamaliza (RO) 
to make a thing oilcloth 


“to cover something up’ 


(7) ussada (6usuec, ...) Ha ceemno — usceemza (6usnec, ...) (BG) 
izvadya (biznes, ..)nasvetlo - izsvetlya (biznes, ..) 
bring.v (business, ...) to light - make.brighter.v (business, ...) 


‘to legalize (business, ...)’ 
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(8) a face la rotisor — rotisa (RO) 
“to cook in a rotisserie’ 


(9) instalator de gaze — gazist (RO) 
'gas installer' 


(10) 69s0yx nod nanseaue - 6vsOyxap (BG) 
vázduh pod nalyagane - vázduhar 


€ . C d 4 D D D 
air under pressure' - 'an unreliable or incompetent person (especially 
one who pretends otherwise)’ 


This type of derivation involves semantic condensation as one of the content 
words of the MWE, the one that carries most of the semantic load, takes up the 
meaning of the whole. The word may be adapted morphologically to express 
the relevant part of speech, for example by means of suffixation with a verbal 
suffix, e.g. RO -iza (6), where the noun musama ‘oilcloth’ yields the derivative 
verb musamaliza, by means of back-formation, e.g. (8) where the verb rotisa is 
created from rotisor, or parasynthetically, e.g. BG us-, —a (7), where the nomi- 
nalized adjective cBerno, svetlo, ‘light’ gives the verb us-ceemn-a (iz-svetl-ya) 
‘make brighter’. In addition, noun suffixes, such as the agentive suffixes RO -ist 
(9) and BG -ap (10), express the semantic role of the derived noun. 

These types of derivation seem to affect collocations (8), terms (9) and idioms 
(6), (7), (10) alike. In Romanian linguistics, the phenomenon has been described 
as very frequent and systematic (Groza 2011). However, no quantitative support 
has been offered for these claims and, as a consequence, we will not adhere to 
this estimation. In the Bulgarian literature, the specialists have remarked that 
while dephraseologization (and semantic condensation) is a productive process 
in the contemporary language, word-formation processes, including derivation, 
are relatively rare (Blagoeva 2011). We will not investigate this phenomenon here. 


3 Data selection and processing 


In order to study the behavior of MWEs with respect to derivation, we worked 
with an inventory of MWEs extracted from big Romanian and Bulgarian dictio- 
naries containing MWEs. 

For Romanian, this inventory was created starting from DELS (Dictionary of 
Expressions, Idioms and Collocations) (Máránduc 2010). The dictionary was auto- 
matically parsed, MWEs were extracted and those marked as archaic were elim- 
inated, along with expressions, as they are unproductive with respect to deriva- 
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tion; for the remaining 11,158 MWEs (collocations and idioms) we looked for 
derivationally related MWESs by searching the web and manually inspecting the 
results. Only for about 500 MWEs could we find derivationally related MWEs. So, 
a first remark is the relatively low impact that derivation has on MWES, at least 
judging from the Romanian data. This may be a reason why the two phenomena 
have rarely been discussed together. 

The Romanian MWEs were preprocessed and annotated morphosyntactically: 
they were automatically tokenized, lemmatized, tagged for part-of-speech (PoS) 
and chunked using the TTL web service (Ion 2007). Each word form in the MWE 
was identified, lemmatized and a PoS tag containing information about its part of 
speech and morphosyntactic characteristics (number, gender, case, etc., depend- 
ing on the PoS) was attached to it. Syntactic groups were identified and marked 
as such, they are called chunks and are useful for the analysis in $6. 

The Bulgarian data were excerpted from a large electronic dictionary of MWEs 
(Stoyanova & Todorova 2014). Named entities were removed since they are un- 
productive with respect to the phenomena explored in this work. The remaining 
MWES were inspected and other unproductive types, such as proverbs, sayings 
and other expressions, were also filtered out automatically, using as a filter the 
code of the relevant type of MWE. Finally, obsolete and dialect entries were man- 
ually removed. The resulting dictionary of 4,039 entries consists predominantly 
of verb idioms and support verb constructions. The number of entries reflects 
two facts: (i) many Bulgarian verbs form aspectual pairs, whose members are 
distinct lexemes with their own inflectional and derivational morphology; there- 
fore, unless there are semantic restrictions to the contrary, two MWE entries 
(one headed by a perfective aspect verb and one by an imperfective aspect verb) 
were encoded in the dictionary; (ii) (to a lesser degree) prefixation is a regular 
process which creates new verbs. Although the prefixed verbs meaning is modi- 
fied to a lesser or to a greater extent, a variant of the MWE headed by a prefixed 
verb is often formed, thus the word family of a verb idiom may include a num- 
ber of derived verb idioms. In the dictionary we kept only the more frequent 
MWES, derived through prefixation, basically those bearing resultative meaning, 
e.g. (15). We found derivational MWEs for 2,612 entries in the dictionary, with 
a great prevalence of deverbal MWEs. The data were additionally supplemented 
with examples collected by the authors, adding up to 2,725 pairs. 

The MWEs were automatically tokenized, lemmatized and PoS-tagged using 
the Bulgarian Language Processing Chain (LPC) (Koeva & Genov 2011), which is 
available as a web service, and subsequently chunked using a stand-alone tool 
which uses the LPC output (Stoyanova et al. 2015). As a result, all the words 
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in each MWE were marked with the relevant grammatical information, and the 
basic syntactic structure of the MWEs (head, dependent syntactic groups) was 
identified and marked explicitly. 


4 Derivation types in the domain of MWEs 


In this section we present the types of derivation detected in the domain of 
MWEs: progressive derivation by means of suffixes, prefixes or both, back- 
formation and zero-derivation. 


4.1 Progressive derivation 


The vast majority of derivation cases are progressive (i.e, MWEs are created 
by adding affixes to a word in a previously existent MWE). In Bulgarian and 
Romanian, these affixes can be suffixes, prefixes or both. Each subtype will be 
discussed in the subsections below. 


4.1.1 Suffixation 


In Romanian all 339 cases of progressive derivation are represented by suffix- 
ation. In Bulgarian almost all of the 2,704 instances of progressive derivation 
are accounted for by suffixation, with the exception of 10 cases of parasynthetic 
derivation. The productivity of the suffixes in the two languages is represented 
in Table 1.? 

Other Romanian suffixes (-a, -ime, -iza) are much less productive in the set 
of pairs we dealt with, with only one or a maximum of two occurrences. In Bul- 
garian, other suffixes denoting events or results of events are instantiated with 
only a few examples in the data: -ex (three cases), —o6 (one case), -uya (one 
case). The noun suffix -ocm, which denotes properties, is found in three cases. 
Other agentive suffixes are -aum/—eum (two cases), -usa (two cases), —ux (one 
case). The suffixes -ypa, -uwe, —yua are found with institutions (one example 
per suffix). 

There are cases when the same MWE serves as a derivational base for two 
different MWEs. There are two ways in which this can be achieved. The first one 
is through separate derivational paths. The derivative MWEs in (11) and (12) are 
formed through independent derivational processes: 


? Abbreviations used in Table 1: Ag: Agent; Ev: Event; Instn: Institution; Instr: Instrument; L: 
Language; Re: Result; SVs: Semantic Values; St: State. 
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RO 


RO 


RO 


RO 


BG 


BG 


BG 


BG 


BG 


BG 


—ne 


-ach 


—or/— 
er/—ir 


-tel 


—ets 


13 


10 


SVs 
Ev 


Ag 
St 
Ev 


Instn 


Ev 
Ev, Re 
Ev, Re 


Ag 


Instr 


Ag 


Ag 


Ag 


Table 1: Suffixes. Their productivity and semantics. 


Examples 


a-si bága mintile in cap, to insert one's minds 
into head, “to come to reason’ - bágare a 
mintilor in cap, inserting one's minds into 
head, ‘coming to reason’ 

a face rele ‘to do bad things’ - făcător de rele 
‘wrongdoer’ 

sărac lipit ‘dog poor’ - sărăcie lipită extreme 
poverty’ 

călători de plăcere ‘to travel for pleasure’ — 
călătorie de plăcere ‘travelling for pleasure’ 
judecator de pace ‘justice of the peace’ - 
judecatorie de pace ‘the court of a justice of 
the peace’ 

a-ti arunca ochii ‘to cast a glance’ - 
aruncäturä de ochi ‘glance’ 

pisha istoriya, to write history, ‘to make his- 
tory’ - pisane na istoriya ‘making of history’ 
prodam na edro, to sell in bulk, ‘to wholesale’ 
- prodazhba na edro ‘a wholesale’ 

svalyam zvezdi, to take down stars, ‘to 
promise the moon’ - svalyach na zvezdi ‘one 
who promises the moon’ 

hvashtam brimki ‘to mend ladders/stitches 
(e.g. in stockings)’ — hvashtach na brimki ‘a 
tool for mending ladders’ 

komandvam parada, to command the parade, 
‘to call the shots’ - komandir na parada ‘one 
who calls the shots’ 

stroya vazdushni kuli ‘to build castles in the 
air’ — stroitel na vazdushni kuli ‘one who 
builds castles in the air’ 

tárguvam na edro ‘to deal wholesale’ - tár- 
govets na edro ‘wholesaler’ 
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(11) a. a aduce laudă - aducere de laudă (RO) 
“to give praise’ - ‘giving praise’ 
b. a aduce laudă — aducător de laudă (RO) 
‘to give praise’ - ‘the one who gives praise’ 
(12) a. pasőusam copya — pas6usane na copya (BG) 
razbivam sártsa — razbivane na sártsa 
“to break hearts’ - “breaking of hearts’ 
b. pas6usam copya — pas6ueau na copya (BG) 


razbivam sártsa - razbivach na sártsa 
“to break hearts’ - ‘heartbreaker’ 


The verb MWEs in (11) and (12) undergo suffixation and yield either an eventive 
noun (by means of the suffixes -re and -ne, respectively) or an agentive one (by 
the suffixes -tor and -au, respectively) in the derivationally related MWEs. 

The second way to form two or more MWEs from the same source follows 
several steps along a single derivational path. We spotted six such instances in 
the Romanian data and three in Bulgarian (Table 2). Typologically, the examples 
are different: in Romanian, the noun-to-noun derivation yields antonyms. In 
Bulgarian, the derived nouns lexicalize different semantic roles in the eventual- 
ity denoted by the corresponding verb. Due to the small number of instances, 
no conclusions can be reached for either of the languages. More examples from 


Table 2: Multiple derivations. 


Language Pattern Productivity Example 


RO V-N-N 3 a sti carte, to know book, ‘to be educated’ 
stiintá de carte ‘education’ 
nestiintá de carte ‘lack of education’ 


RO V-A-A 3 a sti carte, to know book, ‘to be educated’ 
(stiutor de carte, “educated”) 
nestiutor de carte 'uneducated' 


BG V-NAGENT- 3 pera pari “to launder money’ 
NLOCATION perach na pari ‘money launderer’ 
perachnitsa na pari ‘a business involved 
in money laundering' 
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these languages (as well as from others) would help to better understand possible 
derivations. 

Besides, as the verbs belonging to a given aspectual pair in Bulgarian are 
characterized by their own derivational morphology and derivational patterns, 
MWES (just like single words) headed by different members of an aspectual pair 
may serve as a base for derived MWEs with similar semantics, e.g. the imperfec- 
tive aspect verb gives rise to an eventive nominalization (13a) , while the perfec- 
tive aspect counterpart yields a different deverbal MWE with an eventive (and 
possibly resultative) interpretation (13): 


(13) a. noó6exc0üacaM no mouku — nobexdagane no mouku (BG) 
pobezhdavam po tochki - pobezhdavane po tochki 


“to outpoint, to outscore' - 'outpointing' 


b. nobeda no mouku — no6eda no mouku (BG) 


pobedya po tochki - pobeda po tochki 


“to outpoint, to outscore' - 'outpointing' 


4.1.2 Parasynthetic derivation 


Another derivational device detected only in the Bulgarian data is parasynthetic 
derivation, when both a suffix and a prefix are attached to an existing word. All 
ten cases we found in the data represent derivations of verbs from adjectives: 


(14) emaóen Kamo 6bJIK — oenaóuea Kamo enk (BG) 
gladen kato valk - ogladneya kato valk 


‘as hungry as a wolf’ - ‘to become as hungry as a wolf’ 


4.1.3 Prefixation 


Prefixation alone rarely serves as a means for deriving new MWEs in Romanian 
(see the examples of consecutive derivation in Table 2). In the Bulgarian data 
MWES resulting from verb to verb derivation (15), where prefixation is a produc- 
tive process, were included as separate entries in the dictionary and will not be 
discussed further below: 


(15) nepa napu — us-nupam napu (BG) 
pera pari -iz-piram pari 


“to launder money’ - “to launder money up” (resultative meaning) 
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We made this decision because the derivationally related verb MWEs have 
different (although related) meanings and can themselves be subject to deriva- 
tion, e.g. nepa napu (pera pari) to launder money’ - npa-ne na napu (pra-ne 
na pari) ‘money laundering’, uz-nupam napu (iz-piram pari) ‘to launder money 
up’ - us-nupa-ne na napu (iz-pira-ne na pari) ‘money laundering’, resultative 
meaning. 


4.2 Back-formation 


We found only one case of back-formation in Romanian, in which the verb lucra 
is derived from the noun lucru (16), and six cases in Bulgarian (17), all of which 
are neologisms: 


(16) lucru de mână - a lucra de mana (RO) 
‘handiwork’ - ‘to work by hand’ 


(17 npomusane na Moseuu — NPOMUSAM Moseuu (BG) 
promivane na mozátsi - promivam mozátsi 


‘brainwash(ing))’ - ‘to brainwash’ (example from Blagoeva 2008) 


These data reflect a tendency noted in works on Bulgarian terminology and 
neology (Baltova 1986; Kolkovska 1993/1994; Kostova 2013, among others) con- 
cerning the creation of eventive nouns, in particular nouns ending in -ne or 
ending in a verbal suffix followed by -ne that do not have a verb counterpart. 
The corresponding verbs are often formed by back-formation (17) and the newly 
created verbs or verb MWEs can be subject to further derivations: 


(18) npowueaM Moseuu — npoMmusau Ha Moseuu (BG) 
promivam mozátsi - promivach na mozätsi 


“to brainwash’ - ‘brainwasher’ 


4.3 Zero-derivation (conversion) 


Fifteen cases in the Bulgarian data represent the process of conversion (also 
called zero-derivation) in which the derived MWE is formed without the attach- 
ment of a suffix and/or a prefix and usually involves detachment of a grammatical 
affix such as the inflection: 
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(19) yoapa nod kpecma - ydap nod Kpocma (BG) 
udarya pod krásta - udar pod krásta 


“to hit below the belt’ - ‘a hit below the belt’ 


With Romanian MWEs, conversion manifests itself in two ways: (i) the partici- 
ple form functions as an adjective with more than 150 verb MWEs; (ii) the supine 
form of several verb MWEs functions as a noun. The participle and the supine 
are homonymous non-finite verb forms. However, the discussion below will ex- 
clude such cases (and therefore zero-derivation) and will focus only on affixal 
derivation. 


5 The morphological classes of the MWE heads involved 
in MWE derivation 


The formal description and analysis of the basic syntactic structure of MWEs and 
their representation in the lexicon are important for the encoding and prediction 
of some of the major morphological and syntactic properties of the MWEs, such 
as: the components that are likely to inflect; the possibilities for modification 
by optional elements (optional elements are placed in brackets), e.g. BG nepa 
(mpocnu) napu (pera (mräsni) pari) ‘launder (dirty) money’; the possibility for 
eliding modifiers with no change in meaning (placed in square brackets in this 
example), e.g. e0ueaM memsama |eucoxo| (vdigam letvata [visoko]) ‘raise the bar 
[high]; paradigmatic restrictions on agreement, on singular/plural forms, and 
so forth. Among others, the syntactic analysis makes it possible to predict the 
potential of MWEs for derivation and the structural changes that may take place 
in this process (see $6). 

The majority of the Romanian pairs extracted from the DELS involve verbs as 
bases for derivation. The most frequent type is represented by pairs of MWEs 
displaying verb nominalization, while derivative pairs involving other parts of 
speech are much rarer (see Table 3). 

For Bulgarian 2,725 derivative pairs were found. The difference in the number 
of pairs as compared with the initial set of 4,039 entries is due largely to the fact 
that the perfective aspect verbs in the set are very unproductive with respect to 
the derivational processes discussed. Deverbal noun formation accounts for the 
majority of cases (2,663), with much smaller numbers for the opposite noun to 
verb pattern, verb to adjective, adjective to verb, noun to noun, adjective to noun 
(see Table 3). 
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Table 3: Morphological alternations occurring in MWE derivations. 


Stem PoS- 
Z | derived word PoS 


i 


V-A 


#RO 
examples 


w 
A 
M 


#BG 


N 


> 


examples 


a 
a 
LA 


16 


18 


10 


10 


Example from the RO data 


a depune jurámántul ‘to take 
the oath' 

depunerea jurámántului 
“taking the oath' 


a sári in ochi, to jump into 
eyes, ‘to be straightforward’ 


sáritor in ochi, jumping into 
eyes, 'straightforward' 


semnal luminos light signal 


a semnaliza luminos ‘to 


signal with lights' 


judecátor de pace “justice of 
the peace' 

judecátorie de pace “the court 
of a justice of the peace' 


sărac lipit dog poor" 
säräcie lipitá 'extreme 
poverty” 


Example from the BG data 


potrivam rátse “to rub (one's) 
hands” 

potrivane na rátse 'rubbing 
of (one's) hands’ 


málcha kato pán, to keep 
silent like a log, ‘to be as 
mute as a maggot/fish’ 
málchaliv kato pán, silent 
like a log, '(as) mute as a 
poker 


igra na nervi 'a battle of 
nerves' 

igraya na nervi 'to lead a 
battle of nerves' 


voenen prokuror “military 
prosecutor' 

voenna prokuratura 'military 
prosecutor's office' 


nisht duhom ‘poor in spirit’ 
duhovna nishteta 'spiritual 
poverty” 


byal kato platno ‘as white as 
a sheet’ 

pobeleya kato platno “to 
become as white as a sheet’ 
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6 Syntactic reorganizations resulting from derivations 


Dependency Grammar is used as a syntactic framework for our discussion. In 
this framework, verbs admit subjects, complements and adjuncts, nouns (even 
those derived from verbs) admit modifiers and adjectives admit complements. 
Syntactic functions are understood as in Quirk et al. (1985). 

Out of the total number of 414 Romanian pairs, fifty do not undergo any inter- 
nal reorganization in the process of derivation; in Bulgarian this holds true for 
54 out of the 2,725 pairs: 


(20) agent de publicitate — agentie de publicitate (RO) 
“advertising agent’ - 'advertising agency’ 


(21) egenen npokypop — soenna npokypamypa (BG) 
voenen prokuror - voenna prokuratura 


“military prosecutor’ - “military prosecutor’s office’ 


In (20), de publicitate receives the same syntactic analysis in both MWEs: it is a 
modifier of the nouns agent and agentie, respectively. In (21) the adjectives soenen 
(voenen) ‘military’ and eoenna (voenna) ‘military’ which modify the head noun 
npoxypop (prokuror) ‘prosecutor’ and npoxypamypa (prokuratura) ‘prosecutor’ s 
office', respectively, have the same analysis. 

The cases without syntactic reorganization include the noun to noun, verb to 
adjective and adjective to verb patterns. In the following sections we will deal 
with the other two structural types of MWEs found in the data, that is: verb to 
noun and noun to verb MWEs. 

The syntactic structure of the base MWE determines whether the syntactic ex- 
pression of a dependent phrase is obligatory. For instance, a direct object NPpo 
(po stands for direct object) that is not a fixed part of a base MWE but is licensed 
by a transitive verb, as illustrated below, is not an obligatory dependent of the 
derived MWE, while an internal argument that is a fixed part of the base MWE 
is an obligatory component of the derived MWE. For example, in BG, nexeaw 
(naxoz0) 3a0 pewemxume (páhvam (nyakogo) zad reshetkite) ‘put (someone) be- 
hind bars”, the internal argument position (NPpo) is not a fixed part of the idiom; 
rather, it is an open position that is filled by a suitable entity. In the nominaliza- 
tion noxeane (na nsKoeo) 3ad pewemkume (páhvane (na nyakogo) zad reshetkite) 
“putting (of someone) behind bars’ the position corresponding to the direct ob- 
ject may be left empty. On the contrary, if the NPpo is a fixed part of the MWE, 
it cannot be omitted, e.g. Kopwa poue (kársha ratse) ‘wring hands’ - Kopwene na 
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poue (kárshene na rátse) ‘wringing of hands”. The syntactic structure of the base 
MWE also determines the word order of obligatory and non-obligatory compo- 
nents in the derived MWE (e.g. typically the object of the base MWE is closer to 
the deverbal noun than other base MWE components). 

Next, we present the syntactic reorganizations observed in derived MWEs, as 
we found them in the available data for Romanian and Bulgarian. Their documen- 
tation facilitates text processing. Given the limited MWE dictionaries available, 
all knowledge facilitating the automatic morphosyntactic analysis of text is con- 
sidered valuable. Below we offer rules that algorithms can use to process new 
MWES which are derived from existing ones. 


6.1 Verb PP or AdvP complement or adjunct — noun modifier 


This pattern is observed when a verb MWE is related with a noun MWE via 
derivation (see (i1) below) or the other way round (i2). It accounts for 260 Roma- 
nian and 792 Bulgarian pairs:? 


(i) (1) VP [V PP/AdvP] > NP [Ny-derived PP/AdvP] 
(2) NP [N PP/AdvP] > VP Du derivea PP/AdvP] 


The verb admits a prepositional phrase (PP) or an adverbial phrase (AdvP) 
functioning as a complement or an adjunct in the MWE, but it can also admit 
other modifiers placed out of the MWE. Through derivation, the constituents, 
except for the head word, preserve their syntactic category and internal structure, 
as can be noticed in (i), where the form of the modifying phrase is the same. 
Although semantically the dependent functions similarly in the NP and the VP, 
its syntactic role is different according to our analysis: when the head is a verb, 
we analyse the particular dependent as a complement or an adjunct and, when 
the head is a noun, we analyse it as a modifier. 

Below we indicate the syntactic category (PP, AdvP) and the status (comple- 
ment, adjunct, modifier) of the dependent phrases. Complement or adjunct status 
is determined with respect to the argument structure of the verb that heads the 
respective MWE. 

In the verb MWE in (22), de credintá is the prepositional object (i.e., a com- 
plement) of the reflexive verb a se lepáda, whereas in the noun MWE the same 
PP is a modifier of the noun lepádare derived from a se lepáda with the suffix 
-re. Likewise, in the BG verb MWE in (23) the Goal PP 6 o»co6a (v dzhoba) 
“in the pocket’ is the prepositional object of the verb 6epxam (bárkam) ‘thrust 


Patterns are enumerated in the text with Roman numbers. 
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one's hand', whereas, in the noun MWE, it functions as a modifier of the noun 
6vpxane (bárkane) ‘thrusting one’s hand’, derived from the verb 6wpxam (bárkam) 
by means of the suffix —He. 


(22) ase lepáda de credintá — lepádare de credintá (RO) 
‘to depart from the faith’ — ‘departing from the faith’ 

(23) 6opKam 6 0xco6a (na naxozo) - 6wpkaue 6 Omoba (BG) 
bárkam v dzhoba (nanyakogo) - bárkane v dzhoba 


thrust.one's.hand.v in the.pocket (of someone) 


'to incur expenses (on someone)' - 'incurring of expenses' 


The PP in penitá (24) is a modifier in the first MWE, and an adjunct of the verb 
in the second MWE and the verb is derived from the noun in the former MWE. 
In (25), the PP nod xpecma (pod krásta) “below the belt’ is an adjunct of the verb 
in the first MWE and a modifier of the noun ydap (udar) ‘hit’ in the second MWE 
and the noun is derived from the verb y0apa (udarya) ‘hit’. 


(24) desen in penitá — a desena in penitá (RO) 
‘pen drawing’ - “to draw in pen’ 


(25) yoapa nod kpecma — ydap nod Kpocma (BG) 
udarya pod krasta - udar pod krästa 


“to hit below the belt’ - ‘a hit below the belt’ 


In (26) the adverb aminte is a complement of the verb in the former MWE and 
a noun modifier in the latter. In (27) the adverb omeucoxo (otvisoko) “from above’ 
is an adjunct of the verb exedam (gledam) ‘look’ in the former MWE and a noun 
modifier in the latter. 


(26) a lua aminte — luare aminte (RO) 
“to take into consideration' - “taking into consideration' 


(27) enedam omeucoko (HaKozo / Heujo) — enedane omeucoko (BG) 
gledam otvisoko (nyakogo / neshto) - gledane otvisoko 


“to look down on (someone / something)’ - looking down on (someone / 
something)' 


In both languages derivation from a noun MWE to a verb MWE is much rarer: 
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(28) upa na nepsu - uepaa na nepeu (BG) 
igra nanervi - igraya na nervi 
a.play of nerves - play.v nerves 


*a battle of nerves' - 'to lead a battle of nerves' 


6.2 Subject complement or object complement - noun modifier 


In the Bulgarian data we detected a small number of verb MWEs that have a 
subject complement or an object complement (Quirk et al. 1985; Downing 2014) 
as part oftheir structure. Syntactically, these complements are expressed as NPs, 
PPs or APs. We use the notation (cs) for subject complements and (co) for object 
complements. 

With this type of derivation, the verb MWE subject complement turns up as a 
modifier in the derived noun MWE (12 cases altogether). The derivation may be 
represented as in (ii). 


(ii) VP [V NPcs/PPcs/APcs] > NP [Ny-derivea NPcs/PPcs/APcs] 


The derivation involves a copular verb, such as com (sam) ‘be’, cmasam (stavam) 
‘become’, ocmasam (ostavam) ‘remain’ or a verb that is not a typical copula (eg. 
omusam (otivam) “go”) and admits a subject complement in the MWE. The de- 
verbal noun (Ny_gerived) derived from this verb heads the noun MWE, and the 
subject complement turns up as a post- modifier that preserves both its syntactic 
category and the type of syntactic linking to the head word. The examples below 
illustrate subject complements - PPcs (29), APcs (30), Nos (31). 


(29) cmasam 3a cmax | —cmaeaue 3a CMAX (BG) 
stavam za smyah -stavane za smyah 
to.become for ridicule - becoming for ridicule 


“to become a laughing stock’ - “becoming a laughing stock’ 


(30) cmasam pa3Hoened — cmasane pasnoened (BG) 
stavam  raznogled - stavane raznogled 
to.become cross-eyed - becoming cross-eyed 


'to become confused or overwhelmed (by something) 


“A subject or an object complement is a constituent that does not represent a new participant 
but completes the predicate by adding information about the subject or the object referent, 
respectively (Downing 2014), e.g. a separate notion in The country became a separate notion, 
young in He died young (subject complement); a genius in People considered Picasso a genius 
(object complement). 
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(31 omusam soünuk - omusane eoünuk (BG) 
otivam voynik - otivane voynik 
go.v X a.soldier - going a.soldier 


'to go into the army' - 'going into the army' 


Derivations involving an object complement are exemplified with 44 cases in 
the data. It typically applies on transitive verbs (but verbs admitting PP-object 
do occur, see (iv), (34) below). The direct object (NPpo) is licensed by the verb that 
heads the MWE but it is not a fixed part of the MWE. In the formal representation 
this NPpo is enclosed in curly brackets "IL As we are particularly concerned 
with the way the structure of the MWE is reorganized, we do not consider the 
expression of the MWE-external NPpo if it occurs, although it obeys the rules 
applying to any direct object: 


(iii) VP [V {NPpo} NPco/PPco/APco] > NP [Ny-derivea {PP [P NPpo]} NPco/ 
PPco/APco] 


Here are examples of an MWE headed by a transitive verb with different real- 
izations of the object complement: an APco (32) and a PPco (33): 


(32) deng (naxoeo) us — Opaue us (na HaKo2o) (BG) 
dera (nyakogo) zhiv - drane zhiv (na nyakogo) 
skin.v (someone) alive - skinning alive (of someone) 


“to cause great trouble (to someone)’ 


(33) npasa (naxozo  / Hewo) na pewemo — npasene na pewemo (BG) 
pravya (nyakogo / neshto) naresheto - pravene na resheto 
make.v (someone / something) a riddle - making a riddle 


(na naxozo ` / HewWo) 

(na nyakogo / neshto) 

(of (someone / something) 

'to make a lot of holes in someone/something, to riddle some- 


one/something' 
The lack of preposition insertion is a structural difference between the deriva- 


tions involving subject/object complement NPs and direct object NPs, since the 
latter normally turn up as prepositional modifiers of the corresponding dever- 
bal nouns. We leave aside the marginal cases of direct objects (not introduced 
by a preposition) that occasionally co-occur with the canonical form in formal 
administrative language. 
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The derivation involving an MWE headed by a verb that admits a prepositional 
object (PPo) has the following representation (iv): 


(iv) VP [V PPo APco/NPco/PPco] > NP [Nv-derivea PPo APco/NPco/PPco] 


(34) exemplifies an MWE with a verb admitting a PP-object. The prepositional 
object retains its syntactic expression when it turns up as an NP modifier. 


(34) xaseaw Ha uepuomo ano — kaseaue Ha uepuomo Garg (BG) 
kazvam na chernoto byalo - kazvane na chernoto byalo 
call.v to the.black white - calling to the.black white 


“to call black white’ - “an instance of calling black white’ 


6.3 Subject or direct object - noun modifier 


This particular derivation pattern concerns fixed subject verb MWESs or fixed 
direct object verb MWEs. Unlike the cases discussed in $6.1 and 86.2, in this cate- 
gory nominalization triggers either insertion of a preposition (in both languages) 
that introduces the former subject or direct object as a prepositional noun modi- 
fier (see 86.3.1), or mapping of the former subject or direct object into a genitive 
modifier (only for Romanian) (see 86.3.2). We will reserve the term ‘genitive mod- 
ifier' for modifiers whose head noun is marked with the genitive case. 


6.3.1 Subject or direct object - prepositional modifier 


A subject (v) or a direct object (vi) in a verb MWE turns up as a prepositional 
modifier of the corresponding deverbal noun that heads the corresponding noun 
MWE. In (v1) and (vil) a noun MWE is derived from a verb MWE, while in (v2) 
and (vi2) a verb MWE is derived from a noun MWE. 


(v) (1) NPs V > NP [Ny-derivea PP[P NPs] 
(2) NP [N PP[P NPs]] > NPs Vn-derived 


(vi) (1) VP [V NPpo] > NP [Ny-derivea PP [P NPpo]] 
(2) NP [N PP [P NPpo]] > VP [Vn-derivea NPpo] 


There are eight pairs in Romanian and seventy-five in Bulgarian involving the 


subject, and thirty-two pairs in Romanian and 1,732 in Bulgarian involving the 
direct object. 
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In Romanian the preposition de is always used and Bulgarian usually adds the 
preposition na. Both prepositions can be glossed in English with of. In Bulgar- 
ian, other prepositions may occur, generally when the noun is derived with a 
suffix other than -ne or -nue (the prevalent suffixes for eventive and/or resulta- 
tive deverbal nouns) or by other derivational means, e.g. zero-derivation, o6uuam 
(obicham) ‘to love’ - o6uu (obich) ‘love’. 

We repeat (2) and (3) as (35) and (36), respectively, in order to exemplify a case 
where the subject of a verb MWE corresponds to a prepositional modifier of the 
deverbal noun that heads the derived noun MWE. The subjects of the verb MWEs 
(cugetul and cesecmma (sávestta)) correspond to the prepositional modifiers of 
the deverbal nouns mustrare and epusene (grizene) which are derived from the 
verbs mustra and epusa (griza) respectively. 


(35) a mustra cugetul (pe cineva) | — mustrare de (RO) 
to chide consciousness (OBJMARKER somebody) - chiding by 
cuget 
consciousness 


€ > € . > 
to have remorse' - “having remorse 


(36) cosecmma epuse (makoz0) | —epuseue Ha cececmma (BG) 
sávestta grize (nyakogo) - grizene na sävestta 
the.consciousness gnaws (someone) - gnawing of the.consciousness 


€ > € » > 
to have remorse' - “having remorse 


(37) and (38) exemplify the case where the direct object of the verb MWE cor- 
responds to a prepositional modifier of the deverbal noun that heads the derived 
noun MWE (derived from the verb) of the noun MWE. The direct objects carte 
and zokywu, lokumi correspond to the prepositional modifiers of the (negative) 
adjective (ne)stiutor and the noun pasmeeau (raztegach) respectively, which have 
been derived from the verbs sti and pazmazam (raztyagam) respectively. 


(37) a sti carte - (ne)stiutor de carte (RO) 
to know book - (not)knowing of book 


“to be educated’ - '(un)educated' 


(38) a. pasmazam nokywu (BG) 
raztyagam lokumi 
'to spin yarn, to tell tales' 
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b. pasmeeau na nokyMu (BG) 
raztegach na lokumi 
spinner of yarn 


“yarnspinner” 


As noted in §6.2, when the direct object is not a fixed part of the MWE, it may 
be left unexpressed. This is not the case with direct objects that are fixed parts of 
base MWEs: they are neither left out nor replaced with a possessive pronoun. 

Apart from a direct object, the MWEs in this and in other categories may have 
other constituents, e.g. (an)other complement(s), such as a prepositional object or 
an adjunct. These constituents preserve their syntactic category and the syntactic 
link to the head word, but assume a different syntactic status (similarly to what 
was presented in $6.1.). 


(39) a despica firul in patru — despicarea firului in patru (RO) 
to split the.hairin four - splitting of.a.hair in four 


“to make small and overly fine distinctions’ 


(40) yena cmomunkama na dee — yenene Ha cmomunkama na dee (BG) 
tsepya stotinkata na dve - tsepene nastotinkata na dve 
splitv the.penny in half - splitting of the.penny in half 
“to be very stingy’ 


In (39) and (40), in patru “in four’ and na 06e (na dve) “in half’ are PPs function- 
ing as adjuncts in the VP and as modifiers in the derived NPs. 

In the Bulgarian data we found MWEs headed by verbs that take a direct ob- 
ject and an object complement that are both part of the MWE (three pairs). The 
construction has the following form (vii): 


(vii) VP [V NPpo APco/NPco/PPco] > NP [Ny-derivea PP [P NPpo] APco/ 
NPco/PPco] 


The syntactic status of the MWE constituents in the derived structure is pre- 
dictable: in the noun MWE, the NPpo constituent of the base verb MWE corre- 
sponds to a modifier introduced by a preposition (na) and the object complement 
phrase of the base MWE turns up as a modifier of the derived MWE that is ex- 
pressed in the same way: as an AP in (41) - pasevpsanu (razvárzani) ‘untied’, as 
an NP in (42) 6270 (byalo) ‘white’ and as a PP in (43) - c ucmunckume um umena 
(s istinskite im imena) ‘by their proper names’: 
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(41) ocmasam poyeme (Ha maxozo)  pasewpaanu — ocmasane na (BG) 
ostavyam rátsete (na nyakogo) razvárzani - ostavyane na 
leave.v ` the.hands (of someone) untied - leaving of 
peueme (ma nakozo)  paseepsauu 
ratsete (na nyakogo) razvárzani 
the.hands (of someone) untied 


‘to untie someone's hands’ - 'untying someone's hands’ 


(42) Hapuuam uepuomo ano — Hapuuane ua uepuomo bano (BG) 
naricham chernoto byalo - narichane na chernoto byalo 


call.v the.black white - calling of the.black white 


‘to call black white’ - ‘an instance of calling black white” 


(43) napuuaM newama c ucmunckume UM umena — Hapuuane Ha (BG) 
naricham neshtata s istinskite im imena - narichane na 
Hewama c ucmunckume uM uMena 
neshtata s istinskite im imena 
'call things by their proper names' - 'calling things by their 
proper names' 


6.3.2 Subject or direct object — genitive modifier 


In this case the subject or the direct object of the verb MWE corresponds to a 
genitive modifier in the derived MWE. The reorganization may be represented 
as (viii) and (ix) for the subject and the object, respectively: 


(viii) NPs V > NP [Ny-derived NPs-Genitive] 
(ix) VP [V NPpo] > NP [Ny-derived NPpo-cenitive] 
We encountered 12 MWEs described by (viii) in our data. As shown in (44), 


the subject intunericul corresponds to the genitive modifier of the noun lásarea 
derived from the verb lása: 


"The expression in (42) is synonymous to the one in (34). The structural difference is due to the 
different syntactic properties of the synonymous verbs kasBam, kazvam, ‘call’ and napuuaw, 
naricham, ‘call’: kasgam takes a PP-object in the respective sense, while Hapmuam takes an NP 
object. 
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(44) se lasă întunericul -läsarea întunericului (RO) 
REFL lower the.darkness - lowering of.the.darkness 


“it is getting dark’ - “the fact of getting dark’ 


Forty-six pairs display the type of derivation where the direct object of the 
verb MWE corresponds to a genitive modifier in the noun MWE (45): 


(45) a băga zâzanie - bügarea zázaniei (RO) 
to insert dissension - insertion of.dissension 


‘to sow dissent’ - ‘the sowing of dissent’ 


The direct object zázanie corresponds to a genitive modifier of the noun bá- 
garea, derived from the verb a bága. 


6.4 Adjunct - adjectival modifier 


In this case, the adjunct (either a prepositional or an adverb phrase) modifying 
the verbal head of an MWE corresponds to an adjectival phrase in a noun MWE 
derived from a verb MWE (see (x1) below) or vice versa - a verb MWE is derived 
from a noun MWE and the adjective modifier in the noun MWE corresponds to 
an adjunct (either a prepositional or an adverb phrase) in the verb MWE (see 
(x2) below). This structure, represented in (x), was detected in six Romanian and 
sixteen Bulgarian pairs. 


(x) (1) VP [V PP/Adv(P)] > NP [Nv-aerivea » A(P)] 
(2) NP [Nv-aerived; A(P)] > VP [V PP/Adv(P)] 


In Romanian the modifying adjective usually occurs after the modified noun. 
The normal position of the modifier in Bulgarian is to the left of the modified 
noun. The comma in (x) is used for signalling the possibility of having the modi- 
fier and the modified noun in either order with respect to each other. 

Here are some examples of this type of syntactic reorganization: 


(46) arest preventiv — a aresta preventiv (RO) 
‘preventive detention’ - ‘to subject to preventive detention’ 


(47) wuecmnua uepa — uepas uecmno (BG) 
chestna igra - igraya chestno 


“a fair play’ - ‘to play fairly’ 
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In (46) preventiv is an adjective modifying the noun arest in the first MWE 
and an adverb modifying the verb aresta, derived from arest. Likewise, in the BG 
example (47), uecmna (chestna) ‘fair’ is the adjectival modifier of the noun uepa 
(igra) ‘play’ and uecmuo (chestno) ‘fairly’ is the adverb modifier of the verb uepaa 
(igraya) “play”. 

An example of a derivation of a noun MWE with an adjective modifier (mpesea 
(trezva) “straight”) from a verb MWE with an adverb (mpeseo (trezvo) straight”) 
is shown in (48): 


(48) mucna mpeseo — mpesea Mucor (BG) 
mislya trezvo — trezva misál 


‘to think straight’ — “straight thinking’ 


Examples (46), (47) and (48) involve an adverb modifier in the verb MWE. The 
other type of construction presented in (x) (involving a PP adjunct) is exemplified 


by (49): 


(49) mopeyeam na edpo — edpa mopeosua (BG) 
tárguvam na edro - edra tárgoviya 


“to deal wholesale’ - ‘wholesaling’ 


The PP modifier na edpo (na edro) ‘big/in bulk’ of the verb MWE mopzyeam na 
edpo (targuvam na edro) ‘to deal wholesale’ corresponds to the adjective modifier 
edpa (edra) ‘big’ of the noun MWE edpa mopeosus (edra tárgoviya) ‘wholesaling’. 
Note that variants are possible where the PP adjunct of the verb MWE becomes 
a PP post-modifier in the noun MWE; these cases fall under $6.1. 

Table 4 sums up the data presented in $6, along with their share (in percentage) 
in the overall number of cases that undergo syntactic reorganization (364 for 
Romanian and 2,671 for Bulgarian). 

Several conclusions can be drawn. The MWEs in which the fixed base MWE 
subject corresponds to a fixed PP modifier in the derived MWE or vice versa 
(86.3.1) have a very similar share in the two languages and, as the numbers show, 
the construction is relatively rare. The same holds for verb MWE adjuncts that 
turn up as adjectival modifiers or vice versa (86.4). The cases involving subject 
complements or object complements ($6.2) are found only in Bulgarian. Still, this 
pattern is potentially productive as the head verbs involved in it are very common 
(e.g. make/do). In Romanian, the correspondence between a subject or an object 
and a genitive modifier (86.3.2) is more common than the correspondence to a 
PP modifier (86.3.1). 
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Table 4: Distribution of Romanian and Bulgarian MWEs across types. 


Type No. of RO No.of ROdata% BG data % 
examples examples 

$6.1 PP/AdvP 260 792 71.2% 29.65% 
$6.2 Subject/object 0 56 0% 2.1% 
complement 

86.3.1 Subject 8 75 2.5% 2.81% 
86.3.1 Object 32 1,732 8.8.% 64.84% 
86.3.2 Subject 12 0 3.3% 0% 
86.3.2 Object 46 0 12.63% 0% 
86.4 Adjunct 6 16 1.65% 0.6% 


There is a striking difference with respect to the prevalent base MWE struc- 
ture in each of the languages. In Romanian the most frequent construction is 
verb-prepositional object/adjunct ($6.1), while the verb-direct object construc- 
tion is quite uncommon (86.3.1). In Bulgarian the most frequent type is verb- 
direct object (86.3.1); verb-prepositional object/adjunct (86.1) is also typical, al- 
though there are twice as many verb-direct object constructions. This points to 
a significant difference in the syntactic expression of complements (as reflected 
in the structure of MWEs); PP objects are by far the preferred choice in Roma- 
nian, while in Bulgarian both direct objects and PP objects are common, with a 
marked preference for the former. 


7 Semantics of the derivational patterns 


In this section we present the semantic aspects of the MWEs that involve deriva- 
tion. Although we refer to the semantics of the base MWE, we are more interested 
in the semantics of the derived MWEs. Tables 5 and 6 offer an overview of the 
derived MWE semantics. 

In Romanian, the great majority of the base MWEs (349) designate events. This 
remark correlates with the data in Table 4, where most base MWEs are verbs. 
Furthermore, the derived nominalizations (mostly with the suffix -re) also denote 
events (322 cases): this correlates with the number of V-N pairs in Table 3. 


For verb-noun pairs we used the inventory of morpho-semantic relations from PWN (Fell- 
baum et al. 2009), but we added to it some roles whenever they proved necessary. 
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Table 5: Semantics of the base and the derived MWEs. Frequencies and 
examples. 


Language Base Derived Occurrences Examples 
MWE MWE 

RO Event Event 322 a cădea in păcat ‘to fall into sin’ 
cádere in pácat “falling into sin’ 

BG Event Event 2,590  promivam mozátsi ‘to brainwash’ 
promivane na mozátsi ‘brainwash’ 

RO Event Agent 18 a vána zestre ‘to hunt dowry’ 
vänätor de zestre, hunter of dowry, 
“fortune hunter” 

BG Event Agent 53  promivam mozátsi ‘to brainwash’ 
promivach na mozátsi ‘one who does 
brainwashing' 

RO State — State 1 amustra cugetul (pe cineva) 'to have 
remorse' 
mustrare de cuget "having remorse’ 

BG State — State 12 zhiveya po tsarski “to live regally' 
zhivot po tsarski 'a regal life 

RO Event Instrument 3 arunca flăcări ‘to throw flames’ 
aruncátor de flácári, thrower of flames, 
‘flamethrower’ 

BG Event Instrument 3  razbárkvam karti ‘to shuffle cards’ 
razbarkvach na karti ‘card shuffler’ 

RO State — Experiencer 3 a voi binele ‘to wish well’ 
voitor de bine, wisher of well, 
“well-wisher” 

RO Event Distance 3 arunca cu bätul ‘to throw with a stick’ 
aruncáturá de bat ‘as far as the stick 
can be thrown’ 

BG Event Institution 5  kova zakoni, forge laws, 'to create and 
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promulgate laws’ 
kovachnitsa na zakoni, smithy of laws, 
“the parliament? 


8 Derivation in the domain of multiword expressions 


Table 6: Semantics of the base and the derived MWEs. Frequencies and 
examples. 


Language Base Derived Occurrences Examples 
MWE MWE 


RO Job Institution 8 judecător de pace “justice of the 
peace’ 
judecátorie de pace “the court of a 
justice of the peace' 

BG Job Institution 3  voenen prokuror “military 
prosecutor' 
voenna prokuratura 'military 
prosecutor's office' 

RO Event Vehicle 1 avána submarine 'to hunt for 
submarines' 
vänätor de submarine, hunter of 
submarines, 'a vessel for locating 
and attacking submarines' 


RO Result Action 1 lucru de mână ‘handiwork’ 

a lucra de máná “to work by hand’ 
RO Artefact Event 2 desen în peniță ‘pen drawing’ 

a desena ín penitá “to draw in pen' 
RO Event Characteristic 2  asári in ochi, to jump into eyes, “to 


be straightforward 
sáritor in ochi, jumping into eyes, 


“straightforward' 
BG Event Charac- 8 rabotya kato vol, work like an 
teristic ox/horse, “to work hard’ 


rabotliv kato vol, as hard-working 
as an ox, ‘very hard-working 

BG State Characteristic 8  málcha kato pan, keep silent like a 
log, ‘to be as mute as a maggot/fish’ 
málchaliv kato pán, silent like a log, 
“(as) mute as a maggot/fish’ 

BG Inchoative Characteristic 10 gladen kato valk ‘as hungry asa 

state wolf/bear' 

ogladneya kato válk “to become as 
hungry as a wolf/bear’ 

BG Job Agent 18  softuerno inzhenerstvo 'software 
engineering’ 
softueren inzhener ‘software 
engineer' 
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The Bulgarian data also support the productivity and regularity of the deriva- 
tion of eventive nominalizations (2,590) predominantly with the suffix -ne. An- 
other interesting tendency (though represented by few examples), especially with 
respect to neologisms, is the back-formation of verbs from nouns. The other se- 
mantic types encountered with derived MWEs constitute a small number of the 
overall data. The derivatives such as agents, experiencers, instruments and loca- 
tions are derived primarily from VP [V NPpo] MWES, and less frequently from 
VP [V PP/AdvP]. No examples were found for such nouns derived from MWEs 
with the following syntactic structure: NPs V or VP [V NPcs/PPcs/APcs]. 

The productivity of event nominalization is not unexpected, because in the 
process of MWE-to-MWE derivation the majority of cases account for idiomatic 
(partial) predicate-argument structures. As the structure of eventive nominal- 
izations may reflect the argument structure of the base verb (Grimshaw 1990), it 
readily renders these idiomatic structures. The frequency of use of eventive nom- 
inalizations, whether single words or MWEs, is substantiated by the fact that 
they make it possible to refer to an action/event regardless of its doer and the 
time of occurrence (as expressed by verbal categories) (Pometkova 2006), and 
hence they may be used interchangeably with the verb-headed construction, or 
even may be preferred contextually in certain cases or in certain registers, such 
as scientific discourse. 

In Romanian, the number of nominalizations increases greatly (by almost 200 
cases in our data) when taking into account supine forms of verbs which, via con- 
version, become nouns; and the rest of the MWE behave, in these cases, similarly 
to the cases displaying affixal derivation, i.e. almost the same types of syntactic 
changes occur. Moreover, supine nouns are (for more than 150 cases in our data) 
alternatives to derived nominalizations, with a semantic difference: Cornilescu 
(2001) maintains that —re nominalizations tend to express results, while supines 
express events. 

As the data show, other types of verb-noun derivational patterns, such as the 
ones resulting in agents, experiencers, instruments, locations and so forth, are 
significantly fewer in number. In our opinion, the semantic grounds for this phe- 
nomenon is that the situations described by the respective verb MWEs frequently 
do not conceptualize a particular type of agent, experiencer or instrument and 
so forth that needs to be lexicalized. Moreover, in terms of their semantic and 
syntactic properties, these types of nouns do not as readily inherit and express 
the base verb arguments and/or adjuncts. This is supported by the fact that when 
the need arises for expressing the relevant agentive or instrumental, etc. mean- 
ing, participle-headed constructions are preferred, at least in Bulgarian (50). 
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(50) 63emam pewenue — 63ewau, pewenue (BG) 
vzemam reshenie — vzemasht reshenie 


‘to make a decision’ - ‘(the one) making a decision’ 


These participial constructions may be either contextually used or may un- 
dergo nominalization and lexicalization. 

Besides, another word-formation device that is also frequent is compounding, 
in which case the arguments/adjuncts are incorporated in the word structure. 
Here are some examples of one-word compounds that have MWE counterparts 
in the data e.g. copyepas6usau (sártserazbivach) ‘heartbreaker’, mumopas6usau 
(mitorazbivach) ‘mythbuster’, kodopa36ueau (kodorazbivach) 'codebreaker', mo- 
nemocekau (monetosekach) ‘coiner/minter’, etc. 


8 Conclusions 


Putting MWEs and derivation together, we notice that derivation affects MWEs, 
creating either words or other MWEs. The productivity of this phenomenon 
seems to depend on language characteristics: Bulgarian, a language with aspect, 
allows for more cases of derivation than Romanian, which lacks aspect. Another 
factor influencing productivity is the data set: Romanian DELS lacks terms, which 
do occur in the Bulgarian dictionary and are productive in terms of derivation, 
serving the need for expressing different actors, instruments, objects, places, etc. 
within a domain of activity. 

We have presented data from Bulgarian and Romanian. However, derivation 
has been reported to act upon MWEs in other languages: Piela (2007) discusses 
examples of words created from idioms and argues that this process is productive 
in Polish; in Russian, the process of creating MWEs from MWEs seems to be the 
most productive internal means of MWE formation (Ermakova et al. 2015). We 
can conclude that MWEs are subject to derivation in more languages and com- 
paring and contrasting them from such a perspective can be of linguistic interest. 


9 Acknowledgements 


Most part of the work reported here has been carried out within the project 
PARSing and Multiword Expressions (PARSEME) IC1207 COST Action. Another 
part has been carried out within the joint project "Enhanced Knowledge Bases 
for Bulgarian and Romanian" of the Institute for Bulgarian Language, Bulgarian 


243 


Verginica Barbu Mititelu & Svetlozara Leseva 


Academy of Sciences, and the Research Institute for Artificial Intelligence, Roma- 
nian Academy. 

We would like to thank Ivelina Stoyanova and Maria Todorova for providing 
the Bulgarian MWE dictionary in electronic form, Ivelina Stoyanova for the auto- 
matic processing of the Bulgarian data, Cátálina Máránduc for kindly providing 
us with the electronic version of DELS and Cátálin Mititelu for the automatic 
processing of the Romanian data. 

Last but not least, we are grateful to the anonymous reviewers of the paper 
and to the editors for their comments on the previous versions of the paper and 
for their suggestions that helped us to improve its quality. We also thank Judith 
Elver for being kind enough to proofread this paper on very short notice. 


Abbreviations 
AG Agent INSTR Instrument 
BG Bulgarian L Language 
LPC Bulgarian Language POS part of speech 
Processing Chain RE Result 
DELS Dictionary of Expressions, RO Romanian 
Idioms and Collocations SVS Semantic Values 
EV Event ST State 
INSTN Institution V Verb (in the glosses) 
References 


Baldwin, Timothy. 2004. Multiword expressions. Advanced course. Australasian 
Language Technology Summer School (ALTSS 2004). Sydney, Australia. 

Baldwin, Timothy. 2006. Compositionality and multiword expressions: Six of one, 
half a dozen of the other. Invited talk given at the COLING/ACL'06 Workshop 
on Multiword Expressions: Identifying & Exploiting Underlying Properties. 

Baldwin, Timothy, Colin Bannard, Takaaki Tanaka & Dominic Widdows. 2003. 
An empirical model of multiword expression decomposability. In Proceedings 
of the ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and 
Treatment, vol. 18, 89-96. Sapporo. 

Baldwin, Timothy & Su Nam Kim. 2010. Multiword expressions. In Nitin In- 
durkhya & Fred J. Damerau (eds.), Handbook of Natural Language Processing, 
2nd edn., 267-292. Boca Raton: CRC Press. 


244 


8 Derivation in the domain of multiword expressions 


Baltova, Yulia. 1986. Za nyakoi yavleniya i tendentsii v izgrazhdaneto na leksi- 
kalnata sistema na savremenniya balgaski knizhoven ezik. In Vaprosi na savre- 
mennata balgarska leksikografiya i leksikologiya, 74-80. Sofia: BAS Publishing 
House. 

Blagoeva, Diana. 2008. Novi frazeologichni kalki v balgarskiya ezik (v sapostavka 
s drugi slavyanski ezitsi). In Izsledvaniya po frazeologiya, leksikologiya I leksiko- 
grafiya, 149—153. Sofia: Prof. M. Drinov Publishing House. 

Blagoeva, Diana. 2011. Defrazeologizatsiyata kato iztochnik na leksikalni i seman- 
tichni inovatsii v savremenniya balgarski ezik. In Ezikovedski izsledvaniya v 
chest na Prof. Siyka Spasova-Mihaylova, 139-151. Sofia: Prof. M. Drinov Pub- 
lishing House. 

Cornilescu, Alexandra. 2001. Romanian nominalizations: Case and aspectual 
structure. Journal of Linguistics 37(3). 467-501. 

Downing, Angela. 2014. English grammar: A university course. London & New 
York: Routledge. 

Ermakova, Elena Nicolayevna, Natalia Nikolaevna Zolnikova, Guzel Chakhva- 
rovna Faizullina, Milyausha Sakhretdinovna Khasanova & Tatiana Nikolaevna 
Khlyzova. 2015. Derivation and the derivational space in Phraseology as a prob- 
lem of contemporary language development. Mediterranean Journal of Social 
Sciences 6. 335-340. 

Fellbaum, Christiane, Anne Osherson & Peter E. Clark. 2009. Putting semantics 
into WordNet's “morphosemantic”links. In Zygmunt Vetulani & Hans Uszko- 
reit (eds.), Human Language Technology. Challenges of the Information Society: 
Third Language and Technology Conference, LTC 2007. Revised selected papers, 
350—358. Poznan, Polland. 

Grimshaw, Jane. 1990. Argument structure. Cambridge, MA: MIT Press. 

Groza, Liviu. 2011. Probleme de frazeologie: studii, articole, note. Editura Univer- 
sitátii din Bucuresti. 

Ion, Radu. 2007. Word Sense Disambiguation Methods Applied to English and Ro- 
manian. Romanian Academy, Bucharest dissertation. 

Koeva, Svetla & Angel Genov. 2011. Bulgarian language processing chain. In Pro- 
ceedings of Integration of Multilingual Resources and Tools in Web Applications. 
Workshop in conjunction with GSCL, vol. 26. 

Kolkovska, Siya. 1993/1994. Slovoobrazuvane na imena za deystviya ot otimenni 
glagoli (s ogled na terminologiyata). Balgarski ezik 4. 478-480. 

Kostova, Nadya. 2013. Novite imena za deystviya v balgarskiya ezik i tyahnoto 
leksikografsko predstavyane. In Diana Blagoeva, Sia Borisova Kolkovska & 


245 


Verginica Barbu Mititelu & Svetlozara Leseva 


Margarita Lishkova (eds.), Problemi na neologiyata v slavyanskite ezitsi, 77-108. 
Sofia: Prof. M. Drinov Publishing House. 

Máránduc, Cátálina. 2010. Dictionar de expresii, locutiuni si sintagme ale limbii 
románe. Bucuresti: Corint. 

Marouzeau, Jules. 1933. Lexique delà terminologie linguistique. Paris: Librairie ori- 
entaliste Paul Geuthener. 

Nunberg, Geoffrey, Ivan A. Sag & Thomas Wasow. 1994. Idioms. Language 70(3). 
491-538. 

Piela, Agnieszka. 2007. Od Frazeologizmu do derywatu. LingVaria II 1(3). 41-48. 

Pometkova, Yana. 2006. Spetsifikatsiya na konstruktsiite s otglagolno sasht- 
estvitelno ime vav funktsiyata na podlog v nauchnata rech. In Sbornik Yor- 
danka Marinova. Izsledvaniya po sluchay neyniya sedemdesetgodishen yubiley, 
276-284. V. Tarnovo: Sv. sv. Kiril i Metodiy University Press. 

Quirk, Randolph, Sidney Greenbaum, Geoffrey Leech & Jan Svartvik. 1985. A 
comprehensive grammar of the English language. Longman. 

Sag, Ivan A., Timothy Baldwin, Francis Bond, Ann Copestake & Dan Flickinger. 
2002. Multiword expressions: A pain in the neck for NLP. In Proceedings of the 
3rd International Conference on Intelligent Text Processing and Computational 
Linguistics (CICLing-2002), 1-15. 

Savary, Agata. 2008. Computational Inflection of MultiWord Units. A contrastive 
study of lexical approaches. Linguistic Issues in Language Technology 1(2). 1-53. 

Stoyanova, Ivelina, Svetlozara Leseva & Svetla Koeva. 2015. Partial syntactic 
analysis of Bulgarian. Paisievi chetenia, Plovdiv, 30-31 October 2015. 

Stoyanova, Ivelina & Maria Todorova. 2014. Razrabotvane na rechnitsi ot sastavni 
edinitsi za balgarski. In Svetla Koeva & Diana Blagoeva (eds.), Ezikovi resursi i 
tehnologii za balgarski ezik, 185-201. Sofia: Prof. M. Drinov Publishing House. 


246 


Chapter 9 


Modelling multiword expressions in a 
parallel Bulgarian-English newsmedia 
corpus 


Petya Osenova 
Linguistic Modelling Department, IICT-BAS 


Kiril Simov 
Linguistic Modelling Department, IICT-BAS 


The paper focuses on the modelling of multiword expressions (MWE) in Bulgarian- 
English parallel news corpora (SETimes; CSLI dataset and PennTreebank dataset). 
Observations were made on alignments in which at least one multiword expression 
was used per language. The multiword expressions were classified with respect to 
the PARSEME lexicon-based (WG1) and treebank-based (WG4) classifications. The 
non-MWE counterparts of MWESs are also considered. Our approach is data-driven 
because the data of this study was retrieved from parallel corpora and not from 
bilingual dictionaries. The survey shows that the predominant translation relation 
between Bulgarian and English is MWE-to-word, and that this relation does not 
exclude other translation options. To formalize our observations, a catenae-based 
modelling of the parallel pairs is proposed. 


1 Introduction 


This work proposes a catenae-based modelling of aligned pairs in parallel 
Bulgarian-English news corpora. A representation is suggested that handles bilin- 
gual pairs comprising at least one MWE. Our main aim is to offer a representation 
that deals equally well with cross-language symmetries and asymmetries. 

In each language, MWEs were annotated independently from the alignments 
in the corpus. Then, using the alignments, we examined how MWEs were trans- 
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lated between the two languages. The following general alignment types of ex- 
amples are considered: MWE-to-MWE; MWE-to-word; MWE-to-phrases. This 
general typology is not exhaustive since, and in most of the cases, another trans- 
lation option could have been used. Thus, it is interesting to observe the lexical 
choices actually made in the parallel data. 

In our work we refer to the classifications of MWEs developed within 
PARSEME (PARSing and Multiword Expressions)! in Working Groups 1 and 4 
- WG1: Lexicon-Grammar Interface and WG4: Annotating MWEs in Treebanks. 
The first one focuses on the linguistic properties of MWEs (structure, reflexes 
to alternations such as passivisation, etc.) and is more detailed, while the sec- 
ond one is treebank-related and thus focuses on a different set of MWE features 
such as the structural correspondences among MWEs across languages and the 
distributions observed in corpora. 

The results from the empirical study highlight at least the following issues: (1) 
realization options of different MWE types in two languages with different mor- 
phological complexity and word order; (2) a data-driven typology of alignment 
possibilities among various types of MWEs; (3) modelling the bilingual data with 
a catenae-based approach. 

The paper is structured as follows: $2 outlines the related work; 83 introduces 
catenae in a more formal way and also describes the main operations that can 
be applied on them; 84 presents bilingual catenae; $5 describes the parallel data 
and its classification; and $6 concludes the paper. 


2 Related work 


This section comprises two parts: a discussion on MWE classification and a pre- 
sentation of catenae. Concerning the former, there is extensive literature regard- 
ing the study of MWEs within a language and across languages, theoretical issues 
on MWE modelling, etc. Here only some of them will be mentioned. To the best 
of our knowledge, this is the first attempt to use catenae for modelling bilingual 
or multilingual MWE correspondences. 


2.1 MWE classifications 


There is no widely accepted classification of MWEs (Villavicencio & Kordoni 
2012). For the task of automatic recognition of MWEs in Bulgarian Stoyanova 


'PARSEME is an interdisciplinary scientific network devoted to the role of multiword expres- 
sions in parsing - IC1207 COST Action. 
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(2010) adopts the classification of Baldwin et al. (2003). This classification could be 
characterized as a semantically oriented division, since the MWEs are classified 
as non-decomposable by meaning, idiosyncratically decomposable and simple 
decomposable. 

In Sag et al. (2002) another classification is proposed. The MWEs are divided 
into lexicalized phrases and institutionalized phrases. Here we do not consider 
institutionalized phrases (semantically and syntactically compositional, but sta- 
tistically idiosyncratic) as a distinct group. Lexicalised phrases are further sub- 
divided into fixed expressions, semi-fixed expressions and syntactically flexible 
expressions. Fixed expressions are said to be fully lexicalized and undergoing 
neither morphosyntactic variation nor internal modification. Semi-fixed expres- 
sions have a fixed word order, but “undergo some degree of lexical variation, e.g. 
in the form of inflection, variation in reflexive form, and determiner selection? 
Sag et al. (2002: 4) including non-decomposable idioms and proper names. Syn- 
tactically flexible expressions allow for some variation in their word order (light 
verb constructions, decomposable idioms). 

On the multilinguality front, there are various approaches to different MWE- 
related problems. For example, in Rácz et al. (2014) the multilingual annotation 
of light verb constructions is discussed for English, Spanish, German and Hun- 
garian. The specific annotation properties of these elements are described for 
each language. Another popular task is the construction of bi- or multilingual 
MWE lexicons on the base of parallel or comparable corpora. In Seo et al. (2014) 
a context-oriented method is proposed for French and Korean. 

The WG4 classification was specially tailored to reflect the typology of MWEs 
in syntactically annotated corpora (treebanks). It divides MWEs into the follow- 
ing groups on the basis of the parts-of-speech (PoS) of the head word: 


1. Nominal MWEs 

2. Verbal MWEs 

3. Prepositional MWEs 

4. Adjectival MWEs 

5. MWES of other categories 


6. Proverbs 


Some of these groups are further subdivided into subtypes: Nominal MWEs 
including named entities (NEs), nominal compounds as well as other nominal 
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MWEs and verbal MWEs including phrasal verbs, light verb constructions, VP 
idioms and other verb MWEs. Thus, the WG4 classification is syntax-based. 

WP! classification elaborates the typology by studying idiomaticity and flexi- 
bility on the basis of a large set of morphosyntactic diagnostics. With respect to 
flexibility, the WG1 approach differs from Sag et al. (2002) in providing a coarser 
division between semi-flexible and flexible MWEs. With respect to idiomaticity, 
the classification is based on Baldwin & Kim (2010). It handles five types: lexical, 
syntactic, semantic, pragmatic and statistical idiomaticity. Our work deals with 
the syntactic and semantic idiomaticity in a bilingual context. 


2.2 Catena 


The notion of catena "chain" was introduced in O'Grady (1998: 284) as a mech- 
anism for representing the syntactic structure of idioms. He shows that for this 
task there is need for a definition of syntactic patterns not coinciding with con- 
stituents. A variant of this definition was offered by Osborne (2006): 


The words A, B, and C (order irrelevant) form a chain if and only if A 
immediately dominates B and C, or if and only if A immediately dominates 
B and B immediately dominates C. (Osborne 2006: 258) 


In recent years the notion of catena revived again and was applied to depen- 
dency representations. Catenae have been used successfully for the modelling of 
problematic language phenomena. Gross (2010) presents the morphological and 
syntactic problems that have led to the introduction of the subconstituent catena 
level. Constituency-based analysis has to deal with non-constituent structures in 
ellipsis, idioms, and verb complexes. 

Apart from the linguistic modelling of language phenomena, catenae have 
been used in a number of NLP applications. Maxwell et al. (2013), for example, 
present an approach to Information Retrieval based on catenae. The authors con- 
sider the catena as a mechanism for semantic encoding which overcomes the 
problems of long-distance paths and elliptical sentences. Also, Sanguinetti et al. 
(2014) present a catena-related approach for syntactic alignments in multilingual 
treebanks. In translation research, catenae are best known as "treelets" (Quirk & 
Menezes 2006). We employ catenae, which have already been used in NLP appli- 
cations, to model the interface between the treebank and the lexicon. 

A first attempt to formalise MWE information with catenae is discussed in 
Simov & Osenova (2015). In the next section we present the main notions of our 
proposal. 
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3 Definition of catena. Operations on catenae 


We follow the definition of catena provided by O'Grady (1998) and Gross (2010): a 
CATENA is a Word or a combination of words directly connected with dominance 
relations. In fact, in the domain of dependency trees, this definition is equivalent 
to a subtree definition. Figure 1 shows a complete dependency tree and some of 
its catenae. Notice that the complete tree is also a catena. Individual words are 
catenae, too. With “rootc” we mark the root of the catena that might be identical 
with the root of the complete tree, but it also might be different as in the case of 
John and an apple in Figure 1. 


dobj 
root 


conj 
subj det 


ds N 
John bought and ate an apple 


rootc rootc rootc 


conj 
det 
cc 


John bought and ate an apple 


Figure 1: A complete dependency tree and some of its catenae. 


A catena as an object on its own is a tree in which the nodes are decorated with 
various labels including word forms, lemmas, and parts-of-speech; the grammat- 
ical features and the arcs are augmented with dependency labels. The labeling 
function is partial. Thus, some nodes or arcs remain non-decorated in the catena 
and allow for different mappings to dependency trees. When the catenae are 
not mapped on dependency trees, they are considered part of the lexicon or the 
grammar of a given language. 

We call the mapping ofa catena onto a given dependency tree the realization of 
the catena in the tree. We consider the realization of the catena as a fully specified 
subtree including all the nodes and arc labels. Each realization of a catena has to 
agree with its labeling outside of the dependency tree. For example, the catena 
for (to) spill the beans will allow for any realization of the verb form like in: they 
spilled the beans and he spills the beans. Thus, the catena in the lexicon will be 
underspecified with respect to the grammatical features and word forms for the 
corresponding lexical items. 
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Lexical catena: 


root 


wit 


Realization 1: 


rootc 


iobj pobj 


Nc Vpi Nc 
3ATBAPAXA — daxruTe 

pe (DaKT 

js) 


Realization 2: 


root 


subj 
(clitic] 
us T ame 


Ce 


Figure 2: Catena realization. 


In this paper, the underspecified catena is called a lexicon catena (LC) and 
it is stored in the lexical entries. Figure 2 shows a lexical catena for the idiom 
3ameaps-M cu ou-u-me (zatvarya-m si ochite) close-PRS.1SG REFL eye-PL-DEF shut 
one's eyes,” and two of its realizations. Catenae in the lexicon do not specify 
any particular word order.? The word order of the catena realization reflects the 


"Examples contain the Bulgarian string in Cyrilic, its latin transcription placed in brackets and 
the gloss. A literal translation may follow in the form of an English text while translations are 
always enclosed in inverted commas (^). 

?Formalisation of the word order within the catena remains an open question for future work. 
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rules of the grammar, therefore, the realisation of the same catena in different 
dependency trees could materialise with different word orders. 

The upper part of the image in Figure 2 represents the lexicon catena for the 
idiom. It determines the fixed elements of the catena: arcs and their labels as well 
as nodes and their labels. More precisely, the following information is included: 
extended part of speech (PoS),* word forms, and lemmas.? The translations of 
the word form are presented, too. A dash (-) under a node indicates that the 
corresponding element is not defined for the given node. In Figure 2, the dash 
represents the fact that the word form for the verb node is underspecified, there- 
fore the idiom can be marked with a variety of tense, person and other values. 

In the two realizations, the fixed elements of the catena are represented as in 
the lexicon catena. Thus, the lemmas are the same as the word forms, the parts- 
of-speech and the grammatical features for the direct object and for the clitic 
are also the same. The realizations are different from the lexicon catenae with 
respect to the word forms and the grammatical features of the verb node: in both 
examples the verb is in past tense while in the first realization it is in plural and 
in the second in singular number. The word order in the two realizations is dif- 
ferent. Thus, the underspecified catenae representation allows for various levels 
of morphosyntactic and semantic flexibility within the multiword expressions. 

The catena representation of the lexical items explicitly denotes their prop- 
erties that constrain their interaction. We proceed to show how we model the 
selectional restriction of a given lexical unit with respect to a catena in a sen- 
tence. The main operation for modelling the interactions among the catenae is 
called COMPOSITION. For example, let us assume that the verb to read requires 
that its subject denotes a human and that its object denotes an information ob- 
ject. In Figure 3 we present how the catena for I read is combined with the catena 
a book in order to form the catena I read a book. The figure represents the level of 
word forms and the level of semantics (specified only for the node, on which the 
composition is performed). The catena for I read ... specifies that the unknown 
direct object has the semantics of an Information Object (InfObj). The catena for 
a book represents the fact that the book is an Information Object. Thus the two 
catenae are composed on the two nodes marked as InfObj. The result is repre- 
sented at the lower part of Figure 3. We have defined the composition operation 
for catenae that agree with each other on one node; the operation can be defined 


"The extended parts of speech are defined as prefixes of the tags in the BulTreeBank tagset: 
http://www.bultreebank.org/TechRep/BTB-TR03.pdf 

‘In some examples we give the important information only, thus, some of these rows are missing. 
In some examples new rows are used to introduce additional information. 
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root 


dobj 
subj 5 
N 
I read a book 


Figure 3: Composition of catenae. 


on more agreeing nodes. 

In Figure 4 the structure of the lexical entry for the verb öa2a-m (byaga-m) run- 
PRS.SG ‘run’ is presented in the sense ‘run away from facts’. The verb selects an 
indirect object in the form of a prepositional phrase introduced with the prepo- 
sition om (ot) ‘from’. In Figure 5 we give the catena for the synonymous MWE 
3ameapam cu ouume (zatvaryam si ochite) close.PRs.1sG REFL eye.PL ‘I close my 
eyes”. 

The lexical entry of a MWE uses the format: a lexicon-catena, semantics and 
valency.* Lexicon-catenae for the MWEs are stored in their canonical form. The 
semantics part of a lexical entry is represented with a logical formula comprising 
elementary predicates. The role of possible modifiers has to be specified in the 
lexicon-catena, if modification of the MWE is possible, for instance when struc- 
tures with modifiers of the noun can be attested in the data. For example, the 
MWE sameapam cu ouume (zatvaryam si ochite) close.PRS.1SG REFL eye.PL.DEF, 
which is synonymous to the verb 6aeam (byagam) run.PRS.1SG, is presented in 
Figure 5.” The valency level is built as follows: the root of the valency catena is 
marked with the identifier of the node in the lexical catena for which the particu- 
lar valency representation is applicable. In Figure 5 the valency representation is 
applicable to the root node CNo1 of the lexical catena. The two catenae are com- 
posed on this node. The composition is applied to the semantics of the lexical 
catena and of the valency catena. Note that the nodes Nol and No2 are different 
from the nodes CNo1 and CNo2. 


‘The corresponding fields in the lexical entry (rows in the tables below) are marked as: LC, SM, 
Fr (for valency frames). 

"The grammatical features are: ‘poss’ for possessive pronoun, ‘plur’ for plural number and ‘def’ 
for definite noun. 
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rootc 


LC 


rootc 


SM 


Pp Nc 
OSS ES def 
02 03 
CNol:( 


run-away-from(e,x0,11), 
fact(z1), [1](x1) } 


LC 
SM | CNol-{ 
run-away-from(e,19,11), 
fact(z1), [1](x1) } 
rootc 
iobj pobj 
Vpi R N 
1 E 
f * : o 


Semantics (SM): 
No2:{ fact(z), [1] (x) } 


Fr 


rootc 


Vpi R 
kel 
kend 
- 
ol ol 02 


Semantics (SM): 
No2:{ fact(z), [1] (x) } 


Figure 4: Lexical entry for the verb run. 


Fr 


rootc 


ol 


Semantics (SM): 
No2:{ fact(z), [1] (x) } 


Figure 5: Lexical entry for I close my eyes. 
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We use catenae to represent both single words and MWEs because single 
words are also catenae by definition. 

We can specify all the grammatical features of a lexical item using the for- 
mal definition of catena given above. The semantics defined in the lexical entry 
can be attached to each node in the lexicon-catena. In Figure 4 there is just one 
node of the lexicon-catena. In this paper, we present only the set of elementary 
predicates rather than providing their full semantic structures because we focus 
on the principles of the representation. In Figure 4 the verb introduces three 
elementary predicates: run-away-from(e, xo, 21), fact(x1), [1](x1). The predicate 
run-away-from(e, xo, £1) represents the event and its main participants: xo, £1. 
The predicate fact(x1) is part of the meaning of the verb in the sense that the 
agent represented by xo will run away from some (unpleasant) situation. The 
underspecified predicate [1](x1) has to be compatible with the predicate fact(x1). 
This predicate is used for incorporating the meaning of the indirect object at 
something in the frame shut one's eyes at something. The valency frame is given 
as a set of valency elements defined as a catena with a semantic description. The 
catena describes the basic structure of the valency element including the neces- 
sary lexical information, grammatical features, and the syntactic relation to the 
main lexical item. The semantic description determines the main semantic con- 
tribution of the frame element and is incorporated in the semantics of the whole 
lexical item with structural sharing. In Figure 4 there is only one frame element. 
It is introduced with the preposition om (ot) ‘from’. The semantics originates in 
the dependent noun that has to be compatible with the predicate fact(x) and in 
the underspecified predicate [1](x,), that may introduce a specific predicate. Via 
the structure sharing index [1], this specific predicate is copied on the semantics 
of the main lexical item. 

The lexical entry in Figure 5 is similar to the one shown in Figure 4. The main 
differences are: the lexicon-catena represents a MWE and not a single word. The 
semantics is the same, because the verb and the MWE are synonyms. The valency 
frame contains two alternative elements for indirect object introduced by two 
different prepositions. The conclusion that the two descriptions are alternatives 
follows from the fact that the verb has only a free indirect object slot. If a direct 
object slot was free as well then the valency set would contain elements to fill 
also this slot; however, in the MWE presented, the direct object slot is occupied 
by a fixed element. 

In a nutshell, catenae are an appropriate mechanism for the representation 
of MWEs because they adequately encode the grammatical flexibility of some 


*For a full semantic representation we employ Minimal Recursion Semantics, introduced by 
Copestake et al. (2005). 
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elements within the MWEs and also allow for the informative representation of 
single words. 

In the rest of the paper we extend the above lexicon model in order to handle 
correspondences among translation pairs with at least one MWE as a member. 


4 Bilingual catena modelling 


In this section we show the treatment of the following bilingual types of pairs in 
Bulgarian and English: MWE-to-MWE and MWE-to-word. Our survey is corpus- 
driven and we have chosen to discuss the most frequent pairs in our data (see 
next section for data statistics). 


4.1 MWE-to-MWE 


Let us consider the example: 


(1) ExaAMPLE RD? 63ema pewenue (vzema reshenie) take.prs.ısG decision 
“reach a decision’. 


The two MWEs are flexible in several ways. First, the verb reach (and the cor- 
responding one in Bulgarian 63ema (vzema) ‘take, get’) allow for morphological 
variation, including tense, person, etc. The noun decision allows for pre- and post- 
modifiers as in: we reached an important decision or they will reach a decision about 
us tomorrow. The Bulgarian MWE presents the same behavior. Figure 6 shows the 
lexical entry for the parallel MWEs that are modeled as catenae. In the lexical en- 
tries we can see the catenae for both MWEs. In the next row, the semantics of 
the parallel MWEs is represented with a set of elementary predicates coupled 
with a coindexation strategy between the semantics of the MWE and its frame 
semantics. 

In Figure 6, the indices [1] and [2] represent the unknown semantics of the 
modifying nouns. If no modification phrases exist, these predicates are assumed 
to express the most general one, namely everything(x). Thus, the set {take- 
decision(e,zo,z,), decision(x,72), [1](x1), problem(x2), [2](a2)} represents the 
meaning of the MWE": event “take-decision” e with two participants xı and 
12. The participant zo is the agent who takes the decision. The participant zu is 


?We use a special notation after each example: RD, IG and CH for ensuring the correct connec- 
tion with the corresponding pictures in Figures 6, 7, and 8. 

The examples present light verb constructions that are translational equivalents between Bul- 
garian and English. 
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the main argument of the predicate for the relational noun decision that, being a 
two-argument predicate, introduces a third participant in the event, namely the 
problem that the decision is about, denoted with the variable x2. If along with 
the lexicon catena the frame catena is also realized in the sentence, then the new 
predicates introduced by the corresponding nouns are added to the semantics of 
the new bigger catena. This mechanism of representing bilingual lexicon entries 
is suitable for the processing of the bilingual information including the shared 
representation of the semantics and correspondences between the grammatical 
features of the parallel realisations of the catenae in the different languages. 

In some cases the lexical entry of the parallel MWEs might be quite simple, as 
in the following example: 


(2) EXAMPLE IG: xamo yano (kato tsyalo) as whole ‘in general’. 


In Figure 7 the adverbials share the same semantics. They do not have frames 
and they allow for no modification. Only the PoS assigned to their elements may 
be different. 


4.2 MWE-to-word 


Concerning the relation MWE-to-word irrespectively of the language direction, 
two main cases can be observed. The first one relates to functional PoS, such as 
the English preposition after and the Bulgarian complementiser red kamo (sled 
kato) after when, that are translational equivalents and have identical semantics 
but differ in PoS and some selectional properties. 

A challenging problem occurs when non-functional counterparts are consid- 
ered. For example, the term 


(3) ExAMPLE CH: the English term chemicals translates into the Bulgarian 
MWE xumuueck-u npooykm-u (himichesk-i produkt-i) chemical-PL 
product-PL ‘chemical products’. 


Both expressions might be modified by adjectives, PPs or clauses: dangerous 
chemicals, chemicals from airplanes, and chemicals that are used by the phar- 
maceutical industry. We find similar examples in Bulgarian like ompoenu xumu- 
uecku npodykmu (otrovni himicheski produkti) poisonous.PL chemical.PL prod- 
uct.PL “poisonous chemical products’. 

In Figure 8 a part of the parallel lexical entry for this example is presented. It 
can be seen that in the English part of the lexical entry there is a catena for a 
single word while in the Bulgarian part there is a catena for a noun phrase of 
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rootc rootc 
(dobj) 
LC 
SM | CNol:[take-decision(e,z9,21), CNol: ([take-decision(e,ro,a1), 
decision(x 1,72), [1](x1), decision(7 1,72), [1](x1), 
problem(x>), [2](x2) } problem(x>), [2](x2) } 
rootc rootc 
Fr 
SM: Not: [1] (a) } SM: No1: [1] (x) } 
rootc rootc 
[osi = 
Nc R Nc 
s B Si 
Fr ol 02 
SM: Not: problem(z), [2](x) } SM: No2:{ problem(x), [2](x) } 


Figure 6: Parallel Lexical Entries for the parallel MWEs: example RD. 


type adjectival modifier - head noun. The catena for the Bulgarian MWE is un- 
derspecified for the word form and the grammatical features because the whole 
phrase might be definite: xumuueckure npodyxmu (himicheskite produkti) “the 
chemicals”. The English and the Bulgarian entries are specified for the same se- 
mantics. In the frame part of the lexical entries all possible modifications have 
to be defined (in the example just one of them is given, namely left modification 
with adjectives; however, modification with PPs has been encountered in the 
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rootc rootc 
| 
R A R Dm 
DO 
—- (as) 
CNol CNol 
LC 
SM | CNol:{ generally(e,eı) | CNo1:{ generally(e,eı) } 
Fr 


Figure 7: Parallel Lexical Entries for the parallel MWEs: example IG. 


data, etc.). The important point here is that the lexicon catenae for the two lan- 
guages have to contain appropriate correspondences of the frames in order to be 
proper translations of each other. The correspondences of the frames have to be 
established on semantic grounds - the corresponding frames in the English and 
the Bulgarian part have to define the same semantic contributions to the lexical 
catenae. 

The frame catena in Figure 8 marks the fact that the lexical catena can be 
modified by an adjectival modifier. The realization of such amodifier is additional 
to the realization of the adjectival modifier xumuueck-u (himichesk-i) chemical- 
PL that is a fixed part ofthe MWE. In the frame catena we mark only the nominal 
head of the MWE. 

Note that we do not aim at an exhaustive analysis of all the bilingual pairs. Our 
aim is to present a mechanism which would deal with both—symmetric (MWE- 
to-MWE) and asymmetric (MWE-to-word) relations in translations. Our hypoth- 
esis is that the correspondences between the two languages in the lexicon have 
to be governed by the semantics of the lexical catenae and the semantic contri- 
bution of the possible frames. A consequence of this hypothesis is that, in the 
lexicon, we have to allow for correspondences not only between MWEs, but also 
between MWEs and words, and between words/MWEs in one of the languages 
and compositional phrases in the other. 
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rootc rootc 
LC 
SM | CNol:{chemical-product(xo), [1](x)} 
rootc rootc 
A Nc A Nc 
ol Nol ol No2 
Fr 
SM: Not! [1](x) $ SM: No1:{ [1](x) $ 
Fr 2 = 
SM: ... SM: ... 


Figure 8: Parallel Lexical Entries for the parallel MWEs: example CH. 


5 Classification of the parallel data 


In this section we provide a classification of parallel pairs that consist of two 
MWEs or an MWE and a word. For each class of correspondences the minimum 
information to be included in the lexical entries has been specified. The parallel 
Bulgarian-English newsmedia corpus consists of two parts: SETimes plus CSLI 
dataset (920 sentences, or 9308 tokens); PenTreebank dataset (838 sentences, or 
21949 tokens). Thus, our final dataset consists of: 1758 sentences or 31 257 tokens. 
The data was aligned according to Simov et al. (2011). However, the alignments 
did not mark the MWEs. For that reason, additional annotation was performed 
for detecting the alignments with MWEs in at least one of the two languages. 
Our aim was to extract various types of alignments with at least one MWE as 
a member. Thus, our data included the following general types: MWE-to-word; 
MWE-to-MWE and MWE-to-compositional phrase in both language directions. 
As shown in Table 1, 510 occurrences of MWEs were detected within these 
data. 370 MWEs of these occurrences are of type MWE-to-word (for example 
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Table 1: General Classification. 


Occurrences 

MWE-to-MWE 126 
MWE-to-word 370 
MWE-to-phrase 14 
Total 510 

Table 2: MWE-to-Word classification. 

MWE-to-word 

Bulgarian MWE 220 
English MWE 150 
Total 370 


the English within is translated as 6 pamxume na (v ramkite na) in frame.PL of); 
126 MWEs are of type MWE-to-MWE (for example the English with respect to 
is translated as wo ce omnaca òo (shto se otnasya do) as far as relate.Pns.3sG to), 
and 14 MWEs are of type MWE-to-phrase (for example, the English take-it-or- 
leave it is translated as npuemaw uzu ce omkassauı (priemash ili se otkazvash) 
accept.PRS.2SG or refuse.PRS.2SG). 

Table 2 shows the distribution of MWEs in the largest set, namely the set of 
the type MWE-to-word: 220 Bulgarian and 150 English MWEs were detected. 

Two types of classification are applied. First, the aligned pairs are classified 
into three groups: MWE-to-MWE, MWE-to-word and MWE-to-phrase (see Ta- 
bles 1 and 2). This classification offers a coarse picture of the bilingual situation. 
Then, the classification methods developed in PARSEME WG1 and WG4 are ap- 
plied. These classifications draw on the structural and the semantic features of 
MWEs. 

When mapped to the PARSEME WG1/WG4 typologies, both languages showed 
very similar MWE properties. Thus, the most frequent MWE types in both lan- 
guages are: verbal MWEs; noun MWEs; other categories of MWEs. The language 
specific features are evident in the subtypes. Thus, phrasal verbs and reflexive 
(formally or semantically) ce-verbs seem to be the most frequently used verb 
MWEs in the English and Bulgarian data respectively. Both languages feature 
light verb constructions and VP idioms. Lastly, compounds are the most frequent 
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type of noun MWEs in English while adjective-noun phrases are in Bulgarian. 
To present a slightly more detailed analysis of the correspondence type MWE- 
to-MWE, we use the WG1 classification (predominantly the syntactic and seman- 
tic dimensions), that focuses on the internal structure of the MWEs. 
Within the set of the MWE-to-MWE pairs, correspondences are grouped to 
straightforward mappings and to cross-language specific types. A presentation 
of these two groups follows. 


5.1 Straightforward mappings 


The class of straightforward mappings includes: verb MWEs (light verb construc- 
tions, VP idioms) and other categories (adverbs, prepositions), etc. 

In this group of translation equivalent, two main classes of Bulgarian-Eng- 
lish MWE pairs are identified: pairs with cross-lingual variance that have to be 
considered in the lexicons, and MWEs with no cross-lingual variance that are 
trivially handled in the lexicon. In the first case, the grammatical behavior of the 
MWE elements in both languages has to be taken into account, such as the possi- 
bility of inflection for number, or of accepting modifiers. In the second case, the 
MWE elements hardly undergo inflection or modification, so the translational 
equivalents are registered in the lexicon without further elaboration on the be- 
havior of their elements. 

The first case includes verb and noun MWEs and the second one complex PoS 
and non-inflecting MWEs. 

Examples for the first group are given below: 


e Light verbs in one language often correspond to similar constructions in 
the other. For instance, 


- ‘reach a decision’ 63em-a pewenue (vzem-a reshenie) take-Pns.3sc deci- 
sion 
where V NP in English translates to V NP in Bulgarian, 


- ‘take effect’ ezea-e e cuna (vlez-e v sila) enter-PRs.3sG in power, 


— ‘take control’ e;iea-e 69e ezaóenuue (vlez-e vav vladenie) enter-PRs.3sc in 
possession 
where V NP translates to V PP. 


In this group the MWEs are assigned identical semantics, but they might 
differ in the elements and in valence selection. 
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e Noun MWEs of the type AN that are translational equivalents, often are 


literal translations of each other: 
- ‘tough line’ meopda nosuyua (tvarda pozitsiya) tough position, 
- ‘free market’ ceo600nu-a nasap (svobodni-ya pazar) free-DEF market, 


— ‘real estate” nedéuxumo-mo umywecmeo (nedvizhimo-to imushtestvo) 
nonmoving-DEF property. 


The MWES in this group share the same semantics and the same modifica- 
tion mechanisms. 


The structure V NP tends to characterise both members in pairs consisting 
of verb MWE translational equivalents: 

- ‘is drawing fire’ npuenuu-a kpumux-u-me (privlich-a kritik-i-te) attract- 
PRS.3SG critic.PL-DEF, 


- haven't gota clue’ nama-m npedcmaea (nyama-t predstava) not.have.Pns. 
3PL idea. 


The MWEs in this group are assigned the same semantics, but vary in their 
elements and valence selection. 


Examples for the second group are given below: 
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e Multiword adverbial constructions: 


- ‘on the other hand’ om dpyea cmpana (ot druga strana) from other side, 
- ‘of course’ pas6up-a ce (razbir-a se) understand-PRs.3sG REFL, 


- “more and more’ ece nogeue u noseue (vse poveche i poveche) even more 
and more, 


- ‘in particular’ e vacmnocm (v chastnost) in detail. 

Here, however, the prepositional complement varies in the PoS across the 
two languages. For example, in the last translational equivalent the English 
prepositional complement is the adjective particular, while in Bulgarian it 
is the noun uacmnocm (chastnost). 


The MWES in this group are assigned the same semantics, but may vary 
in the elements. However, this difference is not taken into consideration, 
because the elements hardly inflect and do not allow for insertion of addi- 
tional elements. 
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e Complex prepositions in English tend to have structurally similar counter- 
parts in Bulgarian. For instance, 


- “with respect to’ no omnowenue na (po otnoshenie na) at relation to. 


The MWES in this group are assigned the same semantics, but since, pre- 
sumably, they are assigned the same PoS and do not inflect, the element 
variance is not relevant. 


* Conjunctions composed of multiple words: 


- ‘as well as’ kaxmo u (kakto i) as and. 


Like the complex preposition group, this group also contains MWEs that are 
assigned the same semantics and the same PoS; these MWEs do not inflect, there- 
fore the element variance is not relevant. 


5.2 Cross-language specific types 


Here we include English phrasal verbs having Bulgarian reflexive ce-verbs, as 
translational equivalents and English nominal compounds having Bulgarian other 
NP MWEs, mainly adjective-noun or noun-preposition-noun, as translational 
equivalents. In this group, translational equivalents are assigned the same se- 
mantics, but they may present systematic structural differences due to language 
specific constructions. The elements in the MWEs always differ across languages. 


« English phrasal verbs often correspond to Bulgarian ce-verbs: 
- “give up’ ce omxaxc-e (se otkazh-e) REFL decline-PRS.35G, 
- ‘move back’ ce eopna-m (se varna-t) REFL return-PRS.3SG. 


Bulgarian and English MWEs in this group may differ in valency and in 
the way meaning is constructed. Thus, Bulgarian uses the lexical aspect 
and the reflexive ce (se) to construct MWE meanings, while English uses 
the verb in combination with the phrasal affix. 


e English NN compounds can map to AN compounds in Bulgarian: 


- ‘face amount’ nomunanna cmoünocm (nominalna stoynost) nominal value. 


The MWEs in this group differ in the PoS of the modifier of the head noun: 
with Bulgarian A N MWEs the head noun is modified by an adjective and 
with English NN MWEs by a noun. 
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* English N N can also be translated as N PP in Bulgarian. The first N in the 
English MWEs and the PP in the Bulgarian MWEs make the same semantic 
contribution: 


- ‘law enforcement’ cuz-u-me na peo-a (sil-i-te na red-a) force-PL-DEF of 
order-sG.DEF. 


The MWES in this group differ in the PoS of the modifier of the head noun: 


with Bulgarian N NP MWEs the head noun is modified by a PP and with 
English N N MWEs by a noun. 


English N and N constructions can apparently be translated with coordi- 
nated constructions in Bulgarian; however, the PoS of the coordinated con- 
stituents differs across the two languages: 


- ‘pros and cons’ 0oeo0u 3a u npomue (dovodi za i protiv) argument.PL for 
and against 
(N and N / N p and p). 


The MWES in this group differ in the head obligatoriness. In Bulgarian the 
head noun is present, while in English a head noun is only inferred. 


An English idiomatic clausal construction (V NP PP) can be translated with 
a light verb construction in Bulgarian: 


- ‘putting pen to paper’ npeónpuen deücmeue (predpriel deystvie) take.PTSP.35G 
action. 


The MWEs in this group differ with respect to modification and selectional 
properties. The English MWE does not seem to admit any modifiers, while 
its Bulgarian translational equivalent allows for them (for example, npeo- 
npuen BaxHO Oeücmeue (predpriel vazhno deystvie) taken.PTSP.35G impor- 
tant action. 


English V AP can be translated in Bulgarian with minimal changes into V 
AdvP: 


— ‘broke even’ ca usressru nauucmo (sa izlezli nachisto) are come.out.PRST.3SG 
clean. 


The English adjective even translates into the Bulgarian adverb nauucmo 
(nachisto) ‘clean’. 


* English V PP can be translated as V NP in Bulgarian: 
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- ‘will be priced of a job’ we 3aey6a-m pa6oma-ma cu (shte zagubya-t 
rabota-ta si) will lose-prs.3sG job DEF. 


It is interesting to observe that an English passive construction can be 
translated with a Bulgarian active construction. In such cases the valency 
parts will differ with respect to both the predicate and the participants. 


Our work on the Bulgarian-English lexicon aims to provide representations for 
all these types of correspondence: the representations will be bilingual catena- 
based lexical entries. 


6 Conclusions 


The paper has argued that the catena approach can be extended to model pairs of 
translational equivalents retrieved from parallel English-Bulgarian corpora with 
at least one MWE as a member. In this way, cross-language asymmetries are 
handled. Our frequency counts have shown that the MWE-to-MWE and MWE- 
to-word correspondences are prevalent. In contrast, the MWE-to-phrase corre- 
spondence was not found to have a wide distribution. It would be interesting 
to perform a detailed analysis of more examples in order to uncover persistent 
correspondences between the two languages. Such knowledge can be used in 
designing automatic translation systems and in identifying best practices in hu- 
man translation. Furthermore, these correspondences can possibly illuminate the 
different ways employed by the two languages to express meaning. 

The proposed catena model takes into consideration both flexibility and id- 
iomaticity when representing MWEs and words in the lexicon. These dimensions 
can be detailed further depending on the available specific subclassifications in 
a cross-lingual aspect. 
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def definite noun poss possessive pronoun 
LC lexicon catena SM semantics 
Pos part of speech FI valency frames 


plur plural number 
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In this article, we analyze Spanish multiword expressions (MWEs) and describe 
their linguistic properties. The ultimate goal of our analysis is to find an MWE tax- 
onomy for Spanish which is suitable for Natural Language Processing purposes. As 
a starting point of our study, we take the MWE taxonomy proposed by Ramisch 
(2012; 2015). This taxonomy distinguishes between morphosyntactic classes and 
other classes which cannot be considered morphosyntactic and he calls “difficulty 
classes". To carry out our research, a data set of Spanish MWEs was built and sub- 
sequently analyzed. We also added a new axis to Ramisch's (2012; 2015) taxonomy, 
namely the flexibility one introduced by Sag et al. (2002). In the light of our analy- 
sis, we modified and adapted the taxonomy to Spanish MWEs. The different types 
of MWEs in Spanish are analyzed and described in this article. Flexibility tests for 
Spanish MWEs are also discussed. 


1 Introduction 


Research on multiword expressions (MWEs) has a long history both in linguistics 
and in Natural Language Processing (NLP). Many researchers have addressed the 
MWE challenge from different perspectives (Mel'éuk & Polguére 1987; Church & 
Hanks 1990; Sinclair 1991; Smadja 1993; Moon 1998; Lin 1999). 
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MWES are part of the lexicon of native speakers of a language and thus are 
interesting from a theoretical linguistics point of view. Researchers working on 
language acquisition also assess the acquisition of MWEs (Devereux & Costello 
2007; Villavicencio et al. 2012; Nematzadeh et al. 2013); and they have also been 
researched in psycholinguistics (Rapp 2008; Holsinger & Kaiser 2013; Holsinger 
2013; Schulte im Walde & Borgwaldt 2015), among other theoretical fields. In the 
case of NLP applications, MWEs need to be correctly detected and processed. In 
addition, when NLP applications deal with two or more languages, the treatment 
of MWEs needs to deal with multilingual aspects. 

A lot of research has focused on specific subclasses of MWEs (e.g. idioms, col- 
locations, light verb constructions). More general works studying the MWE phe- 
nomenon as such have focused on English, or have taken prior research on En- 
glish as a starting point. However, this English-driven analysis needs to be fur- 
ther investigated taking other languages into account. As the intrinsic character- 
istics of a language vary, it seems necessary to use broad, general taxonomies 
that allow for the classification, description and analysis of MWEs notwithstand- 
ing the language they are applied to. In this article, we test this by analyzing 
Spanish MWEs using an existing taxonomy. 

As a starting point of our study, we take the MWE taxonomy proposed by 
Ramisch (2012; 2015). He distinguishes three morphosyntactic classes and three 
additional so-called "difficulty classes". The three morphosyntactic classes are 
nominal expressions, verbal expressions and adverbial and adjectival expressions. 
Nominal expressions are further subdivided in noun compounds, proper names 
and multiword terms, and verbal expressions in phrasal verbs and light verb con- 
structions. Finally, he distinguishes three difficulty classes: fixed expressions, id- 
iomatic expressions, and "true" collocations. 

We created a data set of Spanish MWEs with the aim of finding examples of 
each type of MWE proposed by Ramisch (2012; 2015). Then, we reviewed our data 
set and the features of the different MWEs gathered. As a result of this study, 
we revised the taxonomy and modified it to make it conform with the Spanish 
language. 

The remainder of this article is structured as follows: $2 summarizes existing 
MWE taxonomies and $3 discusses MWE fixedness tests applicable to Spanish 
and used in our study. $4 explains the creation of our initial data set of Spanish 
MWEs. In 85, we present the taxonomy we propose for Spanish MWEs based on 
the results of our research. We also update the information about our data set, 
expanded to cover all types of MWEs in our new taxonomy. $6 is devoted to the 
description of the linguistic properties of each MWE type for Spanish. Finally, 87 
summarizes our work. 
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2 Multiword expression typologies 


There seems to be a lack ofa commonly used taxonomy of MWEs, both in theoreti- 
cal linguistics and in NLP. In fact, several MWE taxonomies have been proposed 
throughout the years. Most of them have focused on English MWEs, but as we 
will point out later in this section, there also exist other taxonomies based on dif- 
ferent languages. While it is not the purpose of this section to discuss all existing 
MWE taxonomies and assess their applicability to the Spanish language and NLP, 
we think that a brief overview of the state-of-the-art as regards the classification 
of MWEs is needed. This will not only illustrate the task at hand - finding an 
MWE taxonomy suitable for Spanish from an NLP point of view — but it will also 
illustrate the great existing variety of approaches and perspectives. 


2.4 MWE taxonomies in theoretical linguistics 


As mentioned earlier, several researchers have worked on the analysis and classi- 
fication of MWEs from a theoretical linguistics point of view. Some of them, such 
as Moon (1998) worked on specific types of MWEs, while others like Mel'éuk & 
Polguere (1995) and Fillmore et al. (1988) addressed more general issues. As men- 
tioned by Moon (1998), there is a lack of agreement as far as the terminology on 
the topic is concerned and she reported the extended discussions of the problem 
as proof of it. We will not discuss her work here, as her taxonomy - despite be- 
ing a reference - only focuses on English fixed expressions and idioms and leaves 
out other important MWE classes such as compound words because they were 
beyond the scope of her study. 

Fillmore et al. (1988) proposed a typology based on the predictability of a con- 
struction with respect to the syntactic rules. They distinguished three classes: un- 
familiar pieces unfamiliarly combined, familiar pieces unfamiliarly combined, and 
familiar pieces familiarly combined. While familiar pieces familiarly combined are 
formed following the rules of grammar, they have an idiomatic interpretation. Fa- 
miliar pieces unfamiliarly combined require special syntactic and semantic rules, 
and unfamiliar pieces unfamiliarly combined are unpredictable. 

Mel'éuk & Polguére (1995), on the other hand, used as their criterion the rele- 
vance ofan expression as a dictionary entry. Their taxonomy is thus mainly based 
on the semantics of MWEs, and they distinguished between complete 
phrasemes, semi-phrasemes and quasi-phrasemes. In their approach, complete phra- 
semes are fully non-compositional and would constitute an independent dictio- 
nary entry. Semi-phrasemes would be those in which at least one of the elements 
preserves its meaning, and could be listed in the dictionary entry of the base 
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word of the phraseme. Finally, quasi-phrasemes are expressions in which all el- 
ements keep their original meaning but their combination adds an extra element 
of meaning, constituting independent dictionary entries. 


2.2 MWE taxonomies in Natural Language Processing 


MWES are not only a topic of interest in theoretical linguistics. In NLP research 
they constitute a major bottleneck for various applications and tools and thus 
have also been extensively investigated. Sag et al. (2002) and Baldwin & Kim 
(2010) proposed MWE taxonomies from the point of view of NLP. 

Sag et al. (2002) discuss strategies for processing MWEs in NLP applications 
and thus proposed a taxonomy mainly based on their syntactic fixedness, as this 
is what needs to be modeled to deal with MWEs in a successful way. Figure 1 
summarizes their taxonomy. They first distinguish between lexicalized and insti- 
tutionalized phrases and then they further divide lexicalized phrases into fixed 
(e.g. by and large), semi-fixed and syntactically flexible. Semi-fixed MWEs include 
non-decomposable idioms (e.g. to spill the beans; to kick the bucket), compound 
nominals (e.g. attorney general; car park), and proper names (e.g. San Francisco; 
Oakland Raiders). Syntactically-flexible MWEs, on the other hand, include verb- 
particle constructions (e.g. to look up; to break up), decomposable idioms (e.g. to 
let the cat out of the bag; to sweep under the rug), and light verbs (e.g. to make 
a mistake; to give a lecture). According to Sag et al. (2002), lexicalized phrases 
are explicitly encoded in the lexicon, whereas institutionalized phrases are only 
statistically idiomatic.’ 


Multiword expressions 


P E 


Lexicalized Institutionalized 
Fixed ` Semi-fixed Syntactically-Flexible 


uu or ME PE A 


Non-Decomposable Compound Proper Verb-Partidee Decomposable Light 
Idioms Nominals Names Constructions Idioms Verbs 


Figure 1: Taxonomy proposed by Sag et al. (2002). 


‘All examples are taken from Sag et al. (2002). 
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Baldwin & Kim (2010) carry out a twofold classification. They make a mor- 
phosyntactic classification and, additionally, they propose an MWE classification 
based on syntactic variability, which in turn is based on that of Sag et al. (2002). 
In their taxonomy, illustrated in Figure 2, they group compound nominals and 
proper names into a broader category named nominal MWEs. From a morphosyn- 
tactic point of view, they distinguish nominal, verbal and prepositional MWEs. 
Verbal MWEs are further classified into verb-particle constructions, prepositional 
verbs, light-verb constructions and verb-noun idiomatic combinations, and prepo- 
sitional MWEs are classified into determinerless-prepositional phrases (PP-Ds, e.g. 
on top) and complex prepositions (complex PPs, e.g. in addition to). 


Multiword expressions 


Kr ab 


Lexicalized — Institutionalized 


m 


Fixed 


xd REL 


Non-modifiable Complex 


Syntactically-Flexible 


p GM RN 


PP-Ds PPs Verb-Particle Decomposable Light Highly 
Constructions Idioms Verbs ` productive 
PP-Ds 
Semi-fixedP 
Non-Decomposable Nominal PP-Ds with Complex PPs 
Idioms MWEs strict constraints 


Figure 2: Taxonomy proposed by Baldwin & Kim (2010). 


Ramisch (2012; 2015) proposed a simplified typology based on the morphosyn- 
tactic role of the whole MWE in a sentence and its difficulty from an NLP per- 
spective. As illustrated in Figure 3, he identifies three morphosyntactic classes 
(nominal expressions, verbal expressions, and adverbial and adjectival expressions) 
and three additional so-called difficulty classes (fixed expressions, idiomatic expres- 
sions, and "true" collocations). Nominal expressions are further subdivided into 
noun compounds (e.g. traffic light; Russian roulette), proper names (e.g. United Na- 


"For Baldwin € Kim (2010) verb-particle constructions are “a verb and an obligatory particle, 
typically in the form of an intransitive preposition (e.g. play around, take off), but including 
adjectives (e.g. cut short, band together) and verbs (e.g. let go, let fly)". Prepositional verbs are 
"a verb and a selected preposition, with the crucial difference that the preposition is transitive 
(e.g. refer to, look for)”. Although they do not discuss it further, there are cases such as look 
forward to, which would fall into both categories. 
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tions; Alan Turing) and multiword terms (e.g. profit and loss account; myocardial 
infarction). Verbal expressions are further subdivided into phrasal verbs, which 
in turn are subdivided into transitive prepositional verbs (e.g. to agree with; to rely 
on) and more opaque verb-particle constructions (e.g. to give up; to take off); and 
light verb constructions (e.g. to take a walk; to give a talk). 


Multiword expressions 


Difficulty classes 


] Fixed Idioms "True" 
Morphosyntactic classes e f 
expressions collocations 


Nominal Adverbial and 
Expressions jecti 
adjectival 
expressions 
Nominal Multiword 
compounds terms Verbal 
expressions 


Proper 
names qu Ke 
Phrasal Light verb 
verbs constructions 


[E Se 


Transitive Opaque 
prepositional ^ verb-particle 
verbs constructions 


Figure 3: Simplified taxonomy proposed by Ramisch (2012; 2015). 


2.3 Spanish MWE taxonomies 


Although Spanish is a widely researched language, few researchers have worked 
on taxonomies of Spanish MWEs. The main reference for our study could be 
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the seminal work by Corpas Pastor (1996) in Phraseology, who studied Span- 
ish phraseological units, revised previous work and proposed a new taxonomy 
to classify them. Her taxonomy attempted to establish a classification of Span- 
ish phraseological units based on a set of criteria that should help classify any 
unit under a specific type. Her taxonomy, summarized in Figure 4, has three ma- 
jor categories subsequently subdivided in more fine-grained subclasses. While 
collocations are classified following their possible part-of-speech patterns (e.g. 
subject noun-verb, adjective+noun, etc.), expressions are classified according to 
the syntactic role they may have in a sentence (e.g. nominal expressions, verbal 
expressions, prepositional expressions...). Finally, phraseological expressions are di- 
vided into sentences with a specific value, quotes and proverbs. 


Phraseological Units 


cime SE, 


Phrase Sentence 


qae en, | 


Grammatically fixed Fixed by usage Fixed by the system 


Collocations Expressions Phraseological Expresions 


Figure 4: Taxonomy of Spanish phraseological units by Corpas Pastor 
(1996). 


From an NLP point of view, the work by Corpas Pastor (1996) cannot be eas- 
ily adapted for NLP usage because many classes could be difficult to distinguish 
from one another. Nominal expressions, for instance, are further subdivided into 
types following a determined part-of-speech pattern. However, some of these 
patterns are identical to the ones used to classify collocations. Thus, to automat- 
ically determine whether a *noun+adjective” sequence shall be classified as a 
collocation (e.g. enemigo acérrimo ‘archenemy’), or a nominal expression (e.g. 
mosquita muerta 'two-faced person”) could be challenging. 

Finally, it is also worth mentioning the work by Leoni de León (2014), who 
also attempted to propose a typology of phraseological units based on the lex- 
ical status and the syntactic phenomena of MWEs. In his taxonomy, he first 
distinguishes between multi-member lexical units, which are "units of meaning 
without necessarily being lexical units", and collocations, which are “a lexical 
choice probably motivated by communication style, with no semantic implica- 
tions". Multi-member lexical units are further divided into lexicalized units (multi- 
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member lexemes) and non-lexicalized ones. According to Leoni de León (2014), 
multi-member lexemes can be characterized by the procedures used to create 
them. Thus, he distinguishes between those undergoing morphological proce- 
dures (poly-lexemic lexemes), and those undergoing syntactic procedures (com- 
bined lexemes). Non-lexicalized units can either be phrasemes or thematic fusions. 
He defines thematic fusions as "the result of the combination of a supporting 
verb and a predicative nominal", and phrasemes as “unit(s) of meaning formed 
from at least two open-class lexical morphemes, one of which constitutes the nu- 
cleus of the unit and bears the category V”. As far as phrasemes are concerned, 
he distinguishes between "continuous expressions that extend across a sentence" 
(complete phrasemes), and “discontinuous expressions that can be replaced by a 
verb'(syntagmatic phrasemes). Figure 5 illustrates his taxonomy. 


Poly-lexicality 


a ET ee 


Multi-member lexical units — Collocations 


AE Ee 


Multi-memberlexemes Thematic fusions Phrasemes 


P iL pr abs m 


Poly-lexemic | Combined Complete — Syntagmatic 


Figure 5: Taxonomy of Spanish phraseological units by Leoni de León 
(2014). 


In this article, we use the taxonomy proposed by Ramisch (2012; 2015) as a start- 
ing point for a taxonomy of Spanish MWEs and we combine it with the approach 
taken by Sag et al. (2002) and Baldwin & Kim (2010) based on syntactic flexibility. 
This decision was made because these two taxonomies are widely spread among 
the research community and we wanted to test whether an English-driven tax- 
onomy could be applied to the Spanish language. 


3 MWE fixedness tests for Spanish 


As one of our objectives was to classify MWEs according to their degree of syn- 
tactic flexibility, it is important to determine how this flexibility is going to be 
measured. Here, we will consider fixed expressions those which admit no alter- 
ation of their form. Semi-fixed expressions will be those which have a certain 
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degree of morphosyntactic variability. This variability, however, is due to the 
need to conform with the grammatical and orthographical rules of the Spanish 
language and thus is controlled to a certain extent. From an NLP point of view, 
these expressions could be easily processed. In the case of fixed MWEs, the words- 
with-spaces approach proposed by Sag et al. (2002) could be used, while in the 
case of semi-fixed MWEs, this approach could be used adding pointers to the in- 
flected parts of the MWE, just as Sag et al. (2002) also propose. Finally, flexible 
MWES will be those presenting a high degree of variability in their usage (e.g. 
non-contiguousness, free slots, etc.), which makes their form difficult to predict. 

Based on previous work by Nunberg et al. (1994), where they try to determine 
the fixedness of MWEs, we designed a set of potential tests to establish the degree 
of flexibility of Spanish MWEs. This list may be expanded upon further research 
and, as pointed out by Laporte (2018 [this volume ]), it needs further testing to be 
supported with statistics. However, we believe that it is a valid starting point for 
any work on the flexibility of Spanish MWEs and their further linguistic descrip- 
tion. 


3.1 Inflection 


Spanish is a rich morphological language. Thus, the first test that can be used to 
determine whether an MWE has some degree of flexibility is to check its inflec- 
tion. In the case of nouns and adjectives, whether or not these can be inflected 
for number, and in some cases for gender, shall be checked. Generally, adjectives 
agree in number and gender with the nouns they complement. Thus, their inflec- 
tion will be dependent on the possibility to inflect their head noun. Examples 
(1a)- (1b), (2a)-(2b) and (3a)-(3d) exemplify this. 


(1) a. anillo de compromiso b. anillos de | compromiso 
N.MASC.SG PREP N.MASC.SG N.MASC.PL PREP N.MAsC.sG 
ring of engagement rings of engagement 
‘engagement ring’ ‘engagement rings’ 

(2 a. raíz cuadrada b. raíces | cuadradas 
N.FEM.SG ADJ.FEM.SG N.FEM.PL ADJ.FEM.PL 
root square roots square 
“square root’ “square roots’ 
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(3) a. lobo con piel de cordero 
N.MASC.SG PREP N.FEM.SG PREP N.MASC.SG 
wolf.Masc.sG with skin of lamb 


“wolf. Masc.sc in sheep's clothing’ 


b. loba con piel de ` cordero 
N.FEM.SG PREP N.FEM.SG PREP N.MAsC.sG 
wolf.rEM.sc with skin of lamb 


"wolf.FEM.sc in sheep's clothing’ 


c. lobos con piel de | cordero 
N.MASC.PL PREP N.FEM.SG PREP N.MASC.sG 
wolves.MASC.PL with skin of lamb 


‘wolves.MASC.PL in sheep's clothing’ 


d. lobas con piel de cordero 
N.FEM.PL PREP N.rzM.sc PREP N.MAsc.sG 
wolves.FEM.PL with skin of lamb 


‘wolves.FEM.PL in sheep's clothing’ 


When the MWE includes a pronominal reference to a person, this can also have 
some variance to agree with the reference. Additionally, when the MWE includes 
a verb, this can also be inflected for person, tense and mode. Examples (4a)-(4d) 
and (5a)-(5c), respectively, exemplify this. 


(4 a el que corta el bacalao 
DET.MAsc.sc PRON V.3RD.SG.PRES.IND DET.MASC.SG N.MASC.SG 
the who cuts the cod 
‘big fish.MAsc.sc' 

b. la que corta el bacalao 
DET.FEM.SG PRON V.3RD.SG.PRES.IND DET.MAsC.sG N.MASC.SG 
the who cuts the cod 
‘big fish.FEM.SG’ 

c. los que cortan el bacalao 
DET.masc.PL PRON V.3RD.PL.PRES.IND DET.MAsC.sG N.MASC.SG 
the who cut the cod 


“big fishes.MASC.PL’ 
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. las que cortan el bacalao 


DET.FEM.PL PRON V.3RD.PL.PRES.IND DET.MAsC.sG N.MASC.SG 
the who cut the cod 


“big fishes.FEM.PL’ 


Vives a cuerpo de rey 
V.2ND.SG.PRES.IND PREP N.Masc.sc PREP N.MASC.SG 
live.you by body of king 
"You live high on the hog’ 
Vivieron a cuerpo de rey. 
V.3RD.PL.PAST.IND PREP N.MAsc.sc PREP N.MAsC.sG 
lived.they by body of king 
‘They lived high on the hog’ 

. Hubiera vivido a cuerpo de rey. 


V.1ST/3RD.SG.PAST.SUBJ PREP N.masc.sG PREP N.MAsC.sG 
would have lived.I/he/she by body of king 


‘I/he/she would have lived high on the hog: 


As the variation of this type of MWEs is controlled, in our study all MWEs 
which only undergo inflection are classified as semi-flexible MWEs. 


3.2 Change of determiner 


In some cases, the determiner appearing in an MWE is flexible in the sense that 
there are several items that can occupy that spot within the MWE. Examples 
(6a)-(6c) illustrate some of the variations of two of the MWEs in our data set. 


(6) 


a. Nos hicimos varias fotos. 


PRON.1.PL V.1.PL.PAST.IND ADJ.FEM.PL N.FEM.PL 
Ourselves took.1ST.PL several pictures 


"We took several pictures: 


. Nos hicimos muchas fotos. 


PRON.1.PL V.1.PL.PAST.IND ADJ.FEM.PL N.FEM.PL 
Ourselves took.1ST.PL many pictures 


"We took many pictures: 


. Nos hicimos una foto. 


PRON.1.PL V.1.PL.PAST.IND ADJ.FEM.sG N.FEM.SG 
Ourselves took.1ST.PL a picture 


"We took a picture: 
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In our study, if an MWE only undergoes a change of determiner, it is classified 
as a semi-flexible MWE because this feature can be modeled computationally. 


3.3 Pronominalisation 


Another useful test to check the degree of flexibility of an MWE is to test whether 
part of it can be pronominalized. This is only possible for the Noun Phrase and 
Complementizer Phrase parts of verbal MWEs. Examples (7) and (8) illustrate 
such cases.? 

(7 Habíamos quedado para hacerlas fotos el lunes peroal 
Agreed.to.meetisr.PLto make the pictures the Monday, but in.the 
final las hicimos el martes. 
end them made.1sT.PL the Tuesday 


“We had agreed to take the pictures on Monday, but in the end we took 
them on Tuesday. 

(8 Después de cenar dimos un largo paseo por el campo y 
After of dinner went.st.pLa long walk through the field and 
lo disfrutamos | mucho. 
it enjoyed.1sT.PL a lot 


"We went fora long walk through the field after dinner and we 
enjoyed it greatly: 


When part of a Spanish MWE can be pronominalized, we classify such MWE 
as a flexible MWE because the fact that not all lexical elements are together in 
the same clause makes its identification and processing more difficult. While in 
example (7) the object of the MWE (las fotos 'the pictures’) is pronominalized and 
the same verb is used in the second occurrence of the MWE, in example (8) the 
object is used as the object of a different verb (disfrutar 'to enjoy’). 


3.4 Topicalization 


In some cases, it is possible to alter the order in which the elements of an MWE 
appear. Similarly to what happens with the pronominalisation of MWES, topical- 
ization is only possible for the Noun Phrase and Complementizer Phrase parts of 
verbal MWEs. Example (9) shows how the prepositional phrase (de política 'about 


?From here on, we omit the morphological analysis of the examples as it is not needed to illus- 
trate the flexibility issues described. 
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politics”) of a verb with a governed prepositional phrase (hablar de “talk about’) 
may be fronted and appear before the verb itself. Example (10) illustrates how in 
interrogative sentences the noun phrase of a light verb construction (qué trato 
‘what deal’) may also be placed prior to the verb it refers to (harán, “make”).* 


(9) De política no hablaban nada | más que los domingos. 
About politics not talked.3rp nothing more than the Sundays 


“They only talked about politics on Sundays’ 


(10) ¿Qué trato crees que harán las empresas? 
What deal think.2Np that will make app the companies 


"What deal do you think the companies will make?’ 


When an MWE allows for the topicalization of part of it, we classify it as a 
flexible MWE. An additional reason is that when topicalization occurs, the MWE 
appears separated in the clause. As it is not possible to determine how many 
other phrases (and of which type) can appear between the elements of the MWE, 
its successful processing requires more than just a morphosyntactic analysis. 


3.5 Subordinate clauses 


MWES can also appear in complex sentences which have subordinate clauses. In 
this case, two phenomena may occur. First, the MWE can be partially embedded 
in a subordinate clause because the element appearing outside of the subordi- 
nate clause is also the antecedent of the subordinating conjunction. Example (11) 
shows this: el trato 'the deal’ is the antecedent of the subordinating conjunction 
que ‘that/which’. 


(11) El trato que hizo mi hermana consistía en ... 
The deal that made my sister consisted in ... 


"Ihe deal my sister made involved ..? 


Second, part of the MWE can be the antecedent of a subordinate clause, as 
illustrated in (12). 


(12) Mi hermana hizo un trato que consistía en... 
My sister madea deal that consisted in ... 


'My sister made a deal that involved .. 


^In this example, a second phenomenon occurs, as the verb is part of a subordinate clause 
whereas the noun phrase is part of the main clause. This is discussed in the next flexibility test 
in 83.5. 
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When a part of an MWE can be embedded in a relative clause or be the an- 
tecedent of a relative clause, we classify it as a flexible MWE. 


3.6 Passivization 


A frequent way of testing the flexibility of English MWEs is to test whether or not 
their passivization is possible. As the passive voice is not as frequent in Spanish 
as in English, this test may not be very informative for testing Spanish MWEs. 
Moreover, in Spanish there are two passivization mechanisms: 


1. Passives using the auxiliary verb ser 'to be'; and 


2. passives using the pronoun se, also called ‘passive se’. 


Passives using the auxiliary verb ser are not very frequent, and it is common 
to find ‘passive se’ sentences. 

In the case of MWEs, this test can still be used, and in some cases, such as the 
one in example (13), it will be possible to find an MWE appearing in a passive 
voice construction. In some cases, both types of passives are possible. Example 
(14), shows how the passivization of example (13) could be also done by means 
of the Spanish pronoun se. 


(13) La decisión fue tomada el lunes. 
The decision was taken the Monday 


"Ihe decision was made on Monday? 


(14) La decisión se tomará el lunes. 
The decision itself will be taken.3RD.sG the Monday 


“The decision will be made on Monday. 


If an MWE can only undergo passivization (i.e. all other tests are negative), we 
classified it as semi-flexible. Else, we classified it as a flexible MWE. 


3.7 Appearance of other elements 


In some cases, other elements such as adjectives, adverbs or pronouns which do 
not belong to the MWE appear embedded in the MWE. The number of elements 
that can appear embedded in the MWE also varies. There could be only one el- 
ement, or several. Examples (15) to (17) illustrate this. 
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(15 dar un largo paseo 
to take a long walk 


“to take a long walk’ 

(16) dar un largo y agradable paseo 
to take a long and nice walk 
“to take a long and nice walk’ 

(17) echar profundamente la siesta 
to take deeply the nap 
“to take a nap deeply' 


When other elements can appear embedded within the elements of an MWE 
we classified it as a flexible MWE. 


3.8 Ellipsis 


Finally, part of an MWE can sometimes be omitted. This is usually the case when, 
for instance, the object of an MWE has been mentioned earlier and then it is 
referred to at a later stage. Example (18) illustrates this. In the example, the com- 
plement of the verb hacer ‘to do’ is elided but qué ‘what’ is used to refer to it 


“what deal’. 


(18) ¿Qué crees que harán? 
What think.2ND.sG that do.3RD.PL 


“What (deal) do you think they will do?” 


Ellipsis may also occur when there is coordination. Example (19) illustrates 
this by showing two coordinated main clauses that share the same predicate 
(quedarse “to keep for oneself”) with a change both of the subject (María- Juan), 
and of the complement of the prepositional phrase governed by the verb (el libro 
“the book' vs. el disco “the disc”). 


(19) María se quedó con el libro y Juan con el disco. 
María herself kept with the book and Juan with the disc 


“María kept the book and Juan the disc. 


In those cases in which an MWE allows for the omission of part of it, we clas- 
sified the MWE as a flexible MWE. 
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4 Creating a data set to analyze Spanish MWEs 


As a starting point for our study, we took the MWE taxonomy proposed by Ra- 
misch (2012; 2015) and created a preliminary data set of Spanish MWEs. It was 
not compiled by doing a corpus analysis and subsequently trying to analyze and 
classify the MWEs detected, but rather by taking the English examples from Ra- 
misch (2012; 2015) and trying to find similar ones in Spanish. The preliminary 
data set consisted of 150 Spanish MWEs classified according to Ramisch's taxon- 
omy (Parra Escartín et al. 2015). 

Figure 6 exemplifies all of the MWE types distinguished in Ramisch's taxon- 
omy with Spanish examples and their translations into English. As may also be 
observed, there is no example for phrasal verbs. This is because Spanish lacks 
such a type of MWE, although there are verbs with a governed prepositional 
phrase (e.g. acordarse de 'to remember") which, to a certain extent, have a similar 
behavior to that of English phrasal verbs.” 

We then analyzed and classified the MWEs by their degree of difficulty for 
NLP purposes. To this aim, we used the "fixed, semi-fixed, flexible" classification 
proposed in the papers by Sag et al. (2002) and Baldwin & Kim (2010). 

The Spanish Grammar$ (Real Academia Española 2010) was also used to detect 
additional MWE types not present in the taxonomy, describe MWE subclasses, 
and gather further examples for our data set. As we aimed at having a number 
of entries for each MWE type that allowed us to properly describe its features, 
additional new entries were also added to the data set. Appendix A, Appendix B, 
and Appendix C comprise our data set classified in fixed, semi-fixed and flexible 
MWESs respectively. 


5 Our Spanish MWE taxonomy 


When creating our data set, we realized that the taxonomy we had started to work 
with was not completely matching the Spanish MWEs we were gathering. Thus, 
we started to modify the taxonomy and adapt it to the Spanish language. This 


As pointed out in the annotation guidelines for the PARSEME shared task on automatic de- 
tection of verbal multiword expressions (Vincze et al. 2016), VERB PARTICLE CONSTRUCTIONS 
(also called phrasal verbs), “are pervasive in English, German, Hungarian and possible other 
languages but irrelevant to or very rare in Romance and Slavic languages or in Farsi and Greek 
for instance". As Vincze et al. (2016) also point out, contrary to inherently prepositional verbs 
(referred to in this paper as verbs with a governed prepositional phrase), the particle present in 
phrasal verbs cannot introduce a complement. 

*In this article, we use italics to refer to the Spanish grammar written by the Real Academia de 
la Lengua Espanola (RAE, Royal Spanish Language Academy) used as a reference in our work. 
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Noun sacacorchos bottle opener 
compounds ruleta rusa Russian roulette 
; Nueva York New York 
Nominal Proper 5 i 
— y — Unión Europea European Union 
expressions names 
5 Barack Obama Barack Obama 
Multiword cuenta de resultados profit and loss account 
terms infarto de miocardio myocardial infarction 
Morphosyntactic Phrasal 
classes verbs 
Verbal 
expressions " tener fe i 
p Light verb fe to have faith 
L j hacer una foto to take a picture 
constructions 
dar un paseo to go for a walk 
L Adverbial and más o menos more or less 
7 adjectival expressions en líneas generales by and large 
Fixed ad hoc ad hoc 
expressions en lo que respecta a with regard to 
estirar la pata to kick the bucket 
Difficulty Idiomatic poner la antena to listen without being invited to 
classes expressions ponerse las pilas to get one's act together 
cargar las pilas to recharge one's batteries 
“ » 3 escribir una carta to write a letter 
L—— “True” collocations 


firmar un acuerdo 


to sign an agreement 


Figure 6: Spanish MWES classified following Ramisch (2012; 2015) tax- 
onomy. 


confirms the common criticism against current MWE taxonomies claiming they 
are based on the English language and that other languages cannot be classified 
in the same way. 

After revising our data and discussing the different categories we had encoun- 
tered, we first decided to eliminate the types compound nouns and multiword 
terms and add a new category, complex nominals, to account for single-token 
compound nouns in Spanish such as abrebotellas ‘bottle opener’, and syntagmatic 
compounds such as botella de vino ‘wine bottle”. 

The concept of complex nominals was already introduced by Atkins et al. 
(2001) to account for complex nominal constructions in languages other than 
English that can be considered MWEs. While compounds in Germanic languages 
such as English or German are created by appending several nouns together in 
either several tokens (e.g. English) or one (e.g. German, Norwegian), in Span- 
ish (and other Romance languages such as Italian or French), these expressions 
require the usage of prepositions and articles and show a different structure. 
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Multiword terms were eliminated as an MWE type in our taxonomy because 
the different types of terms could be actually classified within other MWE types 
in our taxonomy. Terms might be either single words (e.g. fideicomiso ‘trust’) or 
more complex structures, ranging from complex nominals (e.g. cuenta de resulta- 
dos “profit and loss account’) to verbal MWEs (e.g. fallar a favor ‘to rule in favor’) 
and idiomatic MWEs (e.g. a tenor de lo dispuesto en “in accordance with/under 
the stipulations of’), which justified their reclassification into other categories in 
our new taxonomy. Moreover, terminology is a different research field with its 
own taxonomies for classifying terms. The terms gathered in our data were thus 
redistributed in the other MWE types in our taxonomy. 

Adjectival and adverbial MWEs had to be split in two different categories as 
they do not share the same features. Moreover, a closer look at adjectival expres- 
sions revealed that in Spanish we can distinguish between three different main 
subclasses: compounds, adjectival phrases and adjectives with a governed preposi- 
tional phrase. 

In the case of verbal expressions, we deleted phrasal verbs because, as explained 
earlier (cf. $4), Spanish does not have such type of verbs. In order to cover other 
MWE types in Spanish, we had to add three new subclasses: periphrastic construc- 
tions, verbal phrases and verbs with a prepositional phrase. 

We also decided to eliminate the fixed expressions from the taxonomy as this 
refers to a type of flexibility rather than a type of MWE. According to Ramisch 
(2015), "they correspond to the fixed expressions of Sag et al. (2002), that is, it is 
possible to deal with them using the words-with-spaces approach. Such expres- 
sions often play the role of functional words (in short; with respect to), contain for- 
eign words (ad infinitum; déjà vu) or breach standard grammatical rules (by and 
large; kingdom come)". The fixed expressions present in our data set could easily 
be redistributed across two additional MWE types added to the morphosyntactic 
types: conjunctional phrases and prepositional phrases. Foreign MWEs have been 
excluded of our study because their classification and characterization is beyond 
the scope of this article. 

As far as the other two "difficulty classes" in the taxonomy proposed by Ra- 
misch (2012; 2015), we also eliminated them as they did not comply with our aim 
of classifying MWEs by morphosyntactic types and rather constituted categories 
based on semantic criteria (idioms), or statistical co-occurrence (“true” colloca- 
tions). We reclassified all items in those categories across several of the mor- 
phosyntactic types: complex nominals, light verb constructions and verbal phrases. 
To accommodate the remaining few items that could not be reclassified, we cre- 
ated a new and broader category: sentential expressions. 
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Our taxonomy comprises two different axes: MWE morphosyntactic type and 
flexibility degree. The MWE morphosyntactic type axis is based on Ramisch's (2012; 
2015) taxonomy with the modifications explained above. The flexibility degree 
axis is based on the three levels of MWE flexibility identified by Sag et al. (2002) 
and Baldwin & Kim (2010). Thus, all MWEs in our data set are classified according 
to their morphosyntactic type and flexibility. 

Figure 7 shows our taxonomy and its two main axes: the MWE type and the 
flexibility degree. It also quantifies the number of samples in our data set per 
morphosyntactic type and flexibility. 


6 The linguistic properties of Spanish MWEs 


In what follows we analyze the Spanish MWEs in our data set per type and de- 
scribe their main linguistic properties. The analysis was carried out manually and 
complemented by making searches in Spanish written corpora when we needed 
to verify our linguistic intuition of a particular MWE.’ Specifically, we used two 
contemporary Spanish corpora: CREA? and CORPES XXI.? 

All entries in our data set were manually analyzed. Our manual study, com- 
bined with the grammar study and the corpus queries, allowed us to identify and 
verify the specific linguistic features of Spanish MWEs described here. 


6.1 Adjectival expressions 
6.1.1 Adjectival compounds 


Adjectival compounds in Spanish are one typographic word (e.g. drogadicto 'drug 
addicted’; pelirrojo ‘redheaded’). They are usually formed by joining two adjec- 
tives together, or a noun and an adjective. Although they constitute one typo- 
graphic word, we consider them multiwords because they are composed of sev- 


7A deeper corpus study of the MWEs gathered in our data is planned as future work. 

*Corpus de referencia del español actual (Reference Corpus for Current Spanish): http://corpus. 
rae.es/creanet.html. 

"Corpus del español del siglo XXI (Corpus for 21st Century Spanish): http://web.frl.es/CORPES/ 
view/inicioExterno.view. 

? As mentioned earlier, the inflectional morphology of Spanish is richer than the morphology 
of English and therefore it requires a more detailed linguistic analysis. A similar observation 
was made in Savary (2008) and Gralinski et al. (2010), who studied the complexity of encoding 
MWES in morphologically rich languages such as Polish and French. Testing the formalisms 
they propose is beyond the scope of this article. 
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Figure 7: New MWE taxonomy for Spanish. 
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eral words and might need to be processed in a special way in some NLP appli- 
cations (like Machine Translation), as German compounds, for instance. 

In our data set, all adjectival compounds are semi-flexible.* They inflect either 
in gender (masculine/feminine) and number (singular/plural), or only in number 
(singular/plural)." In some cases, these adjectival compounds are nominalized 
in usage, despite them being adjectives. For instance, drogadicto can occur in 
a sentence as an adjective or a nominalized adjective. Examples (20) and (21) 
illustrate this. 


(20) Ella está ayudando a un hombre drogadicto. 
Sheis helping toa man.N drug.addicted.ADJ 


“She is helping a drug addicted man? 


(21) Ella está ayudando a un drogadicto. 
She is helping toa drug.addicted.N 


“She is helping a drug addict: 


6.1.2 Adjectival phrases 


According to the Spanish Grammar (2010: 261), adjectival phrases are lexicalized 
phrases that behave syntactically like adjectives. Many have the structure of a 
prepositional phrase which complements a head noun, and sometimes are equiva- 
lent to adverbial collocations complementing predicates (e.g. juramento en falso 
“a lie under oath’ vs. jurar en falso ‘to lie under oath’). Alternatively, they can 
also be of the form como ‘as’ followed by a nominal phrase (e.g. como una catedral 
‘huge’). Finally, it is also possible to find adjectival phrases formed by adjectives 
in coordination (e.g. corriente y moliente 'plain ordinary’). 

The majority of the adjectival phrases gathered in our data set are fixed (14), 
although we also registered 2 semi-fixed phrases and 2 flexible ones. The 2 flexible 
phrases are of the type “preposition + noun”, whereas in the semi-fixed ones one 
has the Part-of-Speech (PoS) pattern “preposition + adjective + noun" and the 
other one is of the type "adjective + conjunction + adjective". Moreover, all these 
PoS patterns are also present among the 14 fixed ones, which suggests that there 
is not a preferred form that flavors flexibility." This seems to be in line with the 
fact that these phrases are lexicalized, and thus show a tendency to be invariable. 


"Cf. Figure 7. 

See Appendix B. 

® This shall however be confirmed by undergoing a corpus based analysis of all items in our data 
set and new ones. 
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6.1.3. Adjectives with a governed prepositional phrase 


Adjectives with a governed prepositional phrase are adjectives that are always 
followed by a certain preposition. The preposition is not predictable, since it is 
due to both semantic and historical reasons. Moreover, in some cases the prepo- 
sitional phrase has to be explicit (e.g. carente de ‘deprived of”), whereas in other 
cases where the information is considered to be implicit, the prepositional phrase 
can be omitted (e.g. ser fiel a 'to be loyal to’). 

We gathered 13 adjectives with a governed prepositional phrase in our data set. 
All of them are fully flexible, as they can be modified not only according to num- 
ber (singular/plural) and gender (masculine/feminine), but also allow for other 
elements such as adverbs to be inserted between the adjective and the preposi- 
tional phrase. 


6.2 Adverbial expressions 


According to the Spanish Grammar (2010: 599), adverbial expressions are fixed ex- 
pressions formed by several words that account for a single adverb. They might 
not have the form of an adverb, but they function as such. Some can be substi- 
tuted by adverbs ending in -mente (e.g. en secreto “in secret’ and secretamente 
“secretly”), but most of them have a more specific or slightly different meaning 
from the adverbs which are morphologically similar to the adverbial expression. 

There are some very exceptional cases in Spanish in which adverbial expres- 
sions can be slightly modified (Real Academia Española 2010: 600) by adding a 
suffix to the main noun (e.g. a golpes/a golpetazos,'* ‘violently’; lit. by hits/by 
thumps’) or introducing an adjective between two elements of the expression 
(e.g. a mi entender/a mi modesto entender ‘by my understanding/by my modest 
understanding). 

There are three different types of adverbial expressions in Spanish: 


e “Preposition + noun phrase”, where the noun phrase may be a single noun 
(e.g. por descontado ‘of course”), or a noun modified by other elements such 
as determiners or adjectives (e.g. a la fuerza ‘by force”); 


e “preposition + adjective/participle" (e.g. a escondidas “behind somebody's 
back’; por supuesto “of course”); and 


“In Spanish, the suffix -azo is a very productive suffix with different meanings. Here, it is used 
as an augmentative to indicate the size or strength of the blow. 
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e “lexicalized phrase” which typically expresses quantity, manner and/or de- 
gree (e.g. una barbaridad “quite a lot’; codo con codo ‘elbow to elbow’). 


We gathered a total of 51 adverbial expressions in our data set. 28 of them 
are of the type "preposition + noun phrase” (12 in which the noun phrase is a 
single noun and 16 in which the noun phrase includes modifiers); 11 are of the 
type "preposition 4- adjective/participle", and the remaining 12 are lexicalized 
phrases expressing quantity, manner or degree. A manual analysis of these 51 
items revealed that adverbial expressions in Spanish are mostly fixed in their 
structure, which confirms what is stated in the Spanish Grammar (2010: 601). 


6.3 Conjunctional phrases 


Conjunctional phrases are groups of words containing a conjunction that func- 
tion as a single conjunction (e.g. a fin de que “in order to”). In Spanish, once 
identified, this type of MWEs is easy to deal with from an NLP perspective. They 
are invariable and do not allow the inflection of any of its parts, which would 
allow to process them successfully using the words-with-spaces approach used 
with other fixed expressions. 10 conjunctional phrases were included in our data 
set. 


6.4 Nominal expressions 
6.4.1 Complex nominals 


We have defined this category similarly to what Atkins et al. (2001) propose. Thus, 
it accounts for noun compounds in Spanish, and includes other nominal phrases 
that usually behave as nominal compounds in other languages such as English. 
The Spanish Grammar (2010) accounts for several types of compounds in Spanish: 


e Noun compounds of one typographic word: cascanueces ‘nutcracker’; lim- 
piacristales window cleaner”; aguafiestas “spoilsport'. 


* Noun compounds of two typographic words: two nouns after one an- 
other as in mesa camilla ‘round table’; hombre lobo ‘werewolf’; or a noun 
followed by an adjective as in guerra civil ‘civil war’. 


e Syntagmatic compounds: nominal phrases typically including a preposi- 
tional phrase as in goma de borrar 'eraser'; café con leche ‘coffee with milk’; 
el día a dia ‘everyday life’; ley de la jungla ‘law of the jungle”. 
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We gathered a total of 66 complex nominals in our data set. A manual analysis 
of these 66 items revealed that complex nominals in Spanish are either fixed in 
their structure (23), or semi-fixed (43). 

We further classified our data according to the three types described above. 
11 items were noun compounds of one typographic word, 19 items were noun 
compounds of two typographic words, and the rest (36) were syntagmatic com- 
pounds. All compounds of one typographic word in our data but one are fixed 
and do not experience any kind of morphosyntactic variation in their usage. How- 
ever, this does not hold true for all Spanish noun compounds of one typographic 
word. In our data, most of the noun compounds we gathered end in -s, which 
means that both the singular and the plural forms of such noun compounds are 
the same. Other noun compounds, such as the only one we gathered as semi-fixed 
(bocacalle ‘side-street’) do inflect in plural (bocacalles). 

19 items were noun compounds of two typographic words. In 2 cases these 
noun compounds are fixed and do not show any kind of variance: vergüenza 
ajena 'the feeling of being embarrassed for somebody’, and gripe aviar 'avian 
influenza”. The remaining items can be inflected in either singular or plural and 
thus are semi-fixed. We gathered 13 items of the type "noun + adjective" and 6 
of the type “noun + noun". While the compounds of the type “noun + adjective" 
seem to require that both the noun and the adjective are inflected and agree in 
number, in the case of the “noun + noun” compounds this does not always hold 
true. In some cases, only the head of the compound can be inflected in the plural 
forms (e.g. ciudad dormitorio 'dormitory town' vs. ciudades dormitorio 'dormitory 
towns’; and niño prodigio ‘child prodigy’ vs. niños prodigio ‘child prodigies’). The 
Spanish Grammar (2010) points out that when the modifier of the compound 
adopts an adjectival function (e.g. disco pirata “pirated CD’; momento clave “key 
moment”), the plural form of the compound can be formed by only inflecting the 
head of the compound” (e.g. discos pirata ‘pirated CDs’; momentos clave ‘key 
moments’) or both nouns, the head and the modifier (e.g. discos piratas; momentos 
claves). 

Finally, the remaining 36 items in our data set were syntagmatic compounds. 
11 of them are fixed, while the other 25 are semi-fixed. 

Complex nominals in Spanish can only inflect in terms of number. Although 
there seems to be a pattern in which only the head of the compound is inflected 
(e.g. ciudad/ciudades dormitorio ‘dormitory town/towns’ ), it is not always the 
case. 


In Spanish, the head of a compound is the left-most element in the compound. 
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For NLP purposes, an easy strategy to test whether a complex nominal is fixed 
or allows for inflection would be to inflect the complex nominal in number and 
check whether that form can be found in a monolingual corpus. If it is not the 
case, the complex nominal is fixed. Otherwise, it is semi-fixed. 


6.4.2 Proper names 


Proper names identify a being among others without providing information of its 
features or its constituent parts. These nouns do not express what things are, but 
what their name is as individual entities. Proper names have referring capacity, 
do not participate in lexical relations and, strictly speaking, cannot be translated 
(Spanish Grammar 2010: 209—210). 

The Spanish Grammar (2010: 219) identifies two types of proper names: an- 
throponyms and toponyms. However, it also argues that names that account for 
festivals or celebrations, celestial bodies, allegorical representations, works of art, 
foundations, religious orders, companies, clubs, corporations and other institu- 
tions share the same characteristics. 

We gathered a total of 35 proper names in our data set. A manual analysis of 
these 35 items revealed that proper names in Spanish cannot be morphologically 
modified. 

We classified our data according to the three types listed above. 12 items were 
toponyms, 11 items were anthroponyms, and 12 were classified under “others”, 
which include celestial bodies, works of art, foundations, companies, clubs, cor- 
porations, etc. All those items do not have any kind of morphological variation. 


6.4.3 Nouns with a governed prepositional phrase 


Nouns with a governed prepositional phrase are nouns that are always followed 
by a certain preposition. Occasionally, more than one preposition is possible (e.g. 
actitud con/hacia/respecto de “attitude with/towards/regarding’). This is usually 
the case when the phrase following the preposition indicates matter, direction or 
addressee. In some cases, two prepositions with exactly the same meaning are 
valid (e.g. asalto a/de “assault to/on”; solución a/de “solution to/of”). 

Some nouns followed by a prepositional phrase derive from the verbal form, 
maintaining the same preposition (e.g. oler a/olor a ‘to smell like’/‘smell of’; ex- 
imir de/ exento de ‘to exempt from’/‘exempt from"). There are cases, though, where 
the preposition changes (e.g. amenazar con/amenaza de “to threaten to’/‘threat 
of”; interesarse por/ interesado en ‘to be interested in’/‘interested in”). 
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We gathered 12 nouns with a governed prepositional phrase. As the adjectives 
with a governed prepositional phrase, all of them are fully flexible. They can be 
modified according to number (singular/plural) and gender (masculine/feminine), 
and they admit an adverb and/or an adjective between the noun and the prepo- 
sition. 


6.5 Prepositional phrases 


Prepositional phrases are groups of words containing a preposition that function 
as a single preposition (e.g. en detrimento de 'at the expense of’). Similarly to 
conjunctional phrases (cf. $6.3), these MWEs are fixed in Spanish and thus none 
of its parts can inflect. Our data set includes 10 prepositional phrases. 


6.6 Verbal expressions 
6.6.1 Light verb constructions 


Light verb constructions (LVC) in Spanish are semi-lexicalized verb constructions 
formed by a verb with a supporting role or semantically weak complemented 
by an abstract noun” (Real Academia Española 2010: 14). The Spanish Grammar 
(Real Academia Espafiola 2010: 14) identifies the following light verbs in Spanish: 
dar ‘to give’; tener “to have”; tomar “to take’; hacer ‘to do’ or “to make’; and echar 
“to throw”. In some cases, the noun is preceded by an article. Many LVCs can be 
paraphrased using another single verb with similar meaning (e.g. dar un paseo: 
pasear ‘to take a walk’: ‘to walk’; hacer alusión: aludir ‘to make an allusion’: ‘to 
allude’). 

This definition thus differs from the one offered by Laporte (2018 [this vol- 
ume]), as well as with the one specified in the annotation guidelines for the 
PARSEME shared task on automatic detection of verbal multiword expressions 
(Vincze et al. 2016). Vincze et al. (2016) identify the following six general charac- 
teristics of LVCs: 


1. They are formed by a verb and its argument containing a noun. The argu- 
ment is usually a direct object, but sometimes also a prepositional comple- 
ment or a subject. 


2. Both the verb and the noun (included in the complement) are lexicalized. 


16 The Spanish Grammar (2010: 210) defines abstract nouns as those nouns which refer to some- 
thing ofa non-material nature such as actions, processes and attributes that we assign to beings 
when we think of them as independent entities (e.g. beauty, dirt). 
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3. The verb is "light", i.e. it contributes to the meaning of the whole only to a 
small degree. 


4. The noun has one of its regular meanings. 


5. The noun is predicative, and in LVCs one of its arguments becomes also a 
syntactic argument of the verb. Moreover, the subject is usually an argu- 
ment of the noun. 


6. The noun typically refers to an action or event. 


Bearing in mind that our ultimate goal is to find a taxonomy of Spanish MWEs 
that can be used from an NLP point of view, we took here a rather comprehensive 
approach and combined both definitions. Thus, the LVCs in our data set include 
both expressions including the light verbs identified by The Spanish Grammar, 
and other verbs that in combination with certain nouns can be considered light 
because their meaning is bleached to a certain extent. 

We gathered a total of 42 LVCs in our database. The verbs contained in light 
verb expressions always inflect in person (1st, 2nd, 3rd / singular or plural), tense 
(present, past or future) and mode (indicative, subjunctive or imperative), just as 
any other verb. Most of the times, the other elements of the expression (article 
and noun) can also be modified without changing the meaning of the expression 
(e.g. dar un beso ‘to give a kiss’; dar dos besos ‘to give two kisses"). In our data 
set, the noun phrases of 10 of the 42 LVCs can appear either in singular or plural. 
There are some exceptional cases in which the meaning of the expression changes 
when the noun is singular or plural (e.g. tener gana, ‘to be hungry’ vs. tener ganas 
'to feel like'; hacer ilusión 'to look forward to' vs. hacerse ilusiones 'to get one's 
hope up ).? Finally, adjectives and adverbs can be included between the different 
elements of the expression (e.g. echar profundamente la siesta, 'to take a nap 
deeply’; echar una larga siesta, ‘to take a long nap’), which means that they are 
flexible MWEs. 

Regarding other flexibility tests such as pronominalisation, topicalization, sub- 
ordinate clauses and passivization, further research in large Spanish corpora 
would be required. It seems that most constructions do allow for the pronomi- 
nalization ofthe noun (cf. example (8)) and the appearance of subordinate clauses 
(e.g. El paseo que dimos ayer "Ihe walk we took yesterday"), while they do not 
seem so prone to allow for topicalization or passivization. 


"For more examples of changes in the determiner, see Examples (6a) to (6c). 
These cases are registered in our data set as different MWE entries. 
P Cf. 883.3-3.6. 
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From an NLP perspective, light verb expressions are challenging in Spanish. 
While some issues such as the verb tenses can be targeted specifically, some 
other issues require the usage of other processing strategies. Thus, a change in 
the determiner or the insertion of adjectives and adverbs between the different 
elements of the expression will require the design of specific strategies to suc- 
cessfully identify and process these MWEs. 


6.6.2 Periphrastic constructions 


Verbal periphrastic constructions in Spanish are syntactic combinations in which 
an auxiliary or semi-auxiliary verb is used in combination with a past partici- 
ple, an infinitive or a gerund and both verbs constitute a unique predicate (Real 
Academia Española 2010: 529). The verb used as an auxiliary can also appear in 
non-periphrastic constructions having its full meaning. In some cases, these con- 
structions include the usage of a preposition (e.g. empezar a ... “to begin to ...; 
acabar de ... ‘to have just finished to ...). 

The first verb in the periphrastic construction is the one which undergoes in- 
flection, whereas the second one always appears in the same non-finite form, and 
itis the one which varies and constitutes the main verb of the clause. Sometimes, 
as example (22) shows, an element such as an adverb can appear between the 
first element of the periphrasis and the second one. The subject can also appear 
in between the main verb and the auxiliary or semi-auxiliary verb (example (23)). 


(22) Tuvo casi que saltar para no caerse. 
Had.3RD.SG.MASC/FEM almost that jump for not fall himself/herself. 


‘He/she almost had to jump to avoid falling down. 
(23) No podía yo creérmelo, pero ... 
Not could I believe.it, but ... 


‘I could not believe it, but... 


We gathered a total of 19 periphrastic constructions in our data set. Due to 
their variability in inflection and the allowance of other elements, we have ten- 
tatively classified them as flexible. However, further research is needed to deter- 
mine if certain types could be considered semi-flexible (i.e. those in which the 
MWE only undergoes inflection) because these structures do not seem to allow 
for pronominalization, topicalization, subordination or passivization. 
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(24) Prometió comprar el libro. 
Promised.3RD.SG.MASC/FEM buy the book 


‘He/she promised to buy the book? 


(25) Pudo comprar el libro. 
Could.3RD.sG.MASC/FEM buy the book 


‘He/She could have bought the book’ 


One problem of this type of construction is that sometimes it has the same 
structure as a non-periphrastic one. There are cases, in which a full verb is fol- 
lowed by another verb in a non-finite form, and is the head of the predicate, while 
the non-finite form is introducing a subordinate clause which complements the 
main verb. In such cases, there is no periphrasis. In other cases, the same struc- 
ture (“inflected verb + verb in non-finite form") act as a single unit. In such cases, 
the inflected verb acts as an auxiliary or semi-auxiliary verb, while the main verb 
is the one in non-finite form. Examples (24) and (25) illustrate this. In (24), com- 
prar el libro ‘buy the book’ would be a subordinate infinitive clause that is the 
direct object of the predicate (prometió ‘promised’) of the main clause. In (25), 
however, pudo comprar ‘could have bought’ is the predicate of the clause and el 
libro 'the book' is its direct object. This makes this type of constructions particu- 


larly tricky to detect and to process.?% 


6.6.3 Verbal phrases 


Verbal phrases are those MWEs whose head is a verb and which cannot be classi- 
fied as any other type of verbal MWEs. All of them share the feature that to a cer- 
tain extent they are idiomatic expressions whose semantics are non-composition- 
al. As we aimed at classifying Spanish MWEs from a morphosyntactic point of 
view, many of the items that we originally had classified as idioms following 
Ramisch's taxonomy (2012; 2015) are classified as verbal phrases in our data set. 

In total, 26 items of our data set were classified as verbal phrases. 11 of them 
were classified as semi-fixed MWEs and the remaining 15 as flexible MWEs. In all 
the verbal phrases classified as semi-fixed the verb appearing in the MWE inflects 
(e.g. coger el toro por los cuernos 'to take the bull by the horns'; empezar la casa 
por el tejado 'to put the cart before the horse). 


?"This type of structure is worth researching within a larger project including large corpus 
searches. This is beyond the scope of this article, where we only aim at detecting MWE types 
in Spanish that are not covered in the current MWE taxonomies explained in 82. 
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Finally, we detected cases in which it was also possible for other words to 
appear within the MWE to modify its meaning. In these cases, besides the verb 
inflection and the noun singular/plural and masculine/feminine alternations, the 
MWE could include other modifying elements. For example, entrar al trapo “to 
respond to provocations’, can be modified by elements referring to its frequency 
(e.g. entrar siempre al trapo 'to respond to provocations always). 

Another special type of flexibility is the one created by the presence of reflex- 
ive pronouns as part of the verb in the MWE, because depending on the overall 
structure of the sentence the pronoun may appear in different parts of it. Exam- 
ples (26a) to (26c) below show this phenomenon with the MWE irse de la lengua 
'to let the cat out of the bag". 


(26) a. No tienes que irte de la 
ADV V.2ND.SG.PRES.IND PRON V.INF+PRON.2ND.SG PREP DET.FEM.SG 
not have(.you) that go.yourself of the 
lengua 
N.FEM.SG 
tongue 


“Do not let the cat out of the bag: 


b. No te tienes que ir de la 
ADV PRON.2ND.sG V.2ND.SG.PRES.IND PRON V.INF PREP DET.FEM.sG 
not yourself have(.you) that go of the 
lengua 
N.FEM.SG 
tongue 


“Do not let the cat out of the bag: 


c. Prometió que no se iria de 
V.3RD.SG.PAST.IND PRON ADV PRON.3RD.SG V.3.SG.COND.INDPREP 
Promised.MAsc/rEMthat not himself/herself would go of 
la lengua 
DET.FEM.SG N.FEM.SG 
the tongue 


‘He/she promised not to let the cat out of the bag: 


As MWEs in which a reflexive verb appears also allow for other types of flex- 
ibility such as the apparition of modifiers, we classified them as flexible MWEs. 
However, most of these verbal phrases do not occur undergoing other types of 
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flexibility such as topicalization or passivization and further research is needed 
to confirm their flexibility degree. 


6.6.4 Verbs with a governed prepositional phrase 


Verbs with a governed prepositional phrase are verbs that are always followed 
by a certain preposition.” The preposition is not predictable, since it is due to 
both semantic and historical reasons. Usually, only one preposition governs the 
phrase, though occasionally more than one is possible, especially in those cases 
where the phrase following the preposition indicates matter, direction or ad- 
dressee (e.g. hablar de/sobre/acerca de ‘to talk of/about'; viajar a/hacia/hasta ‘to 
travel to/towards’). 

Spanish reflexive verbs usually have a governed prepositional phrase (e.g. ar- 
repentirse de ‘to regret’; referirse a ‘to refer to’), and a few show a possible al- 
ternation between the governed prepositional phrase and a direct object (e.g. 
quedarse algo/quedarse con algo “to keep something). Finally, some verbs require 
a governed prepositional phrase for some of their meanings. In such cases, the 
meaning of the verb is determined by the occurrence of a governed prepositional 
phrase (e.g. entender algo/entender de algo ‘to understand something’/‘to know 
about something). 

We gathered a total of 21 verbs with a governed prepositional phrase. A manual 
analysis revealed that the verb can always inflect in terms of person, tense and 
mode. As other elements may intervene between the verb and the prepositional 
phrase, and the prepositional phrase can sometimes undergo topicalization (see 
example (9)), we tentatively classified all of them as flexible. 


6.7 Sentential expressions 


Some of the MWEs that we included in our data set constitute full clauses. They all 
share the fact that they are idiomatic expressions as well. However, as we aimed 
at classifying MWEs from a morphosyntactic point of view, we have classified 
them as "sentential expressions". 

In our data set, only 5 MWEs of this type have been gathered. 4 of them are 
fixed, whereas 1 is semi-fixed: la gota que colma el vaso 'straw that breaks the 
camel's back’. Their main difference is that while the fixed ones are fully lexical- 
ized (e.g. cuando el río suena, agua lleva ‘when there is smoke, there is fire”), the 
semi-fixed allows for verb inflection. 


*IThey are similar in this sense to the adjectives and nouns with a governed prepositional phrase 
described in Sections 6.1.3 and 6.4.3. 
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If we consider Spanish proverbs as sentential expressions, this class of our 
data set could be expanded greatly. However, at this point we do not aim at find- 
ing a way of automatically identifying such exceptional cases and characterizing 
them. 


7 Conclusion 


In this article, we have analyzed the different types of Spanish MWEs we identi- 
fied. The starting point of our research was a data set created on the basis of an 
existing taxonomy for MWEs. Upon our linguistic analysis, we realized that such 
taxonomy was not adequate for describing Spanish MWEs and we modified it to 
accommodate our findings. 

One interesting finding is the fact that in Spanish there seem to be some MWE 
categories that are only fixed (conjunctional phrases, prepositional phrases and 
proper names), or only flexible (light verb constructions, adjectives, nouns and verbs 
with governed prepositional phrases and verbal periphrastic constructions). Only 
adjectival compounds are exclusively semi-flexible. The other MWE types having 
semi-flexible MWEs are either also fixed (complex nominals and sentential expres- 
sions), also flexible (verbal phrases) or both fixed and flexible (adjectival expres- 
sions and adverbial phrases). 

It also seems clear that MWE typologies should be adapted to the language un- 
der research, and classic typologies mainly based on the English language do not 
seem adequate to describe and classify MWEs in other languages. Our research is 
proof of this fact. Moreover, the taxonomy proposed here has also shown ways 
of integrating the traditionally considered “difficulty class" of idioms within the 
morphosyntactic classes. 

We believe that our work is novel in the sense that we have tested an existing 
MWE taxonomy to classify Spanish MWEs. In future work we intend to validate 
our data set asking other linguists whether they agree or not with our classifica- 
tion. We also intend to expand it for the categories underrepresented and carry 
out further corpus searches to validate our analyses. 

Another possible path to explore would be to evaluate the extent to which the 
flexibility tests discussed in $3 are valid and whether specific types of MWEs re- 
quire specific tests. It would also be interesting to explore the word-span between 
the different parts of MWEs and whether discontinuous MWEs in Spanish share 


?"The Centro Virtual Cervantes (Instituto Cervantes), has a collection of Spanish proverbs trans- 
lated to other languages and with useful information about their variants and synonyms that 
could be used for further research ( http://cvc.cervantes.es/lengua/refranero/Default.aspx). 
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some features. This would enable their automatic identification and processing 
in NLP applications. 

From a multilingual perspective, it would be interesting to further compare 
our data set with the translations of its entries into other languages. This is inter- 
esting from a traductological point of view, as it would allow to further compare 
MWES and their behavior in different languages. Our data set includes the trans- 
lations into English of all the items. Many Spanish MWEs translate as English 
MWEs. In fields such as translation studies or Machine Translation, a further 
study of these correspondences would be highly relevant. 

Finally, it would also be interesting to see if language families share a common 
MWE taxonomy. We have argued here the need of a language-specific MWE tax- 
onomy. However, it could be that languages belonging to the same language 
family share a taxonomy and thus instead of language-specific taxonomies there 
is a need for language-family specific taxonomies. 
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Abbreviations 
1/2/3  first/second/third person N noun 
ADJ adjective NIP natural language processing 
ADV adverb PAST past tense 
CONJ conjunction PL plural 
DET determiner POS part of speech 
FEM feminine PREP preposition 
IND indicative PRES present tense 
INF ` infinitive PRON pronoun 
GER gerund SG singular 
Lvc light verb construction SUBJ subjunctive 
MASC masculine V verb 


MWE  multiword expression 


303 


Carla Parra Escartín, Almudena Nevado Llopis & Eoghan Sánchez Martínez 


Appendix 

List of abbreviations used in the appendix 
1/2/3 PERS 1st/2nd/3rd person PAST past tense 
ADJ adjective PL plural 
ADV adverb POS possessive 
CONJ conjunction PP past participle 
DET determiner PRES present tense 
FEM feminine PREP preposition 
GER gerund REFLV  reflexive verb 
IND indicative PRON ` pronoun 
INF infinitive SG singular 
MASC masculine SUBJ subjunctive 
N noun V verb 


The following three appendices present the Spanish data set used in this article 
classified according to our taxonomy. It shall be noted that the translations of 
MWES not always result in MWEs in the target language, nor in the same syntactic 
class. 


Appendix A Spanish Fixed MWEs data set 


Table 1: Adjectival phrases. 


Spanish MWE PoS pattern in Spanish English translation 
1  acuadros prep +n plaid 
2 arayas prep +n striped 
3 como puños adv +n like daggers 
4 como una catedral adv + det + n huge 
5 contante y sonante adj + conj + adj hard cash 
6 corriente y moliente adj + conj + adj plain ordinary 
7 de gala prep +n gala 
8 de pared prep +n wall 
9 de segunda mano prep + adj + n second hand 
10 en directo prep +n live 
11 en falso prep 4- adj lie 
12 en jarras prep +n on hips 
13 en vivo prep + adj live 
14 mondo y lirondo adj 4- conj 4- adj plain and simple 
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Table 2: Adverbial expressions. 


Spanish PoS pattern in Spanish English translation 
MWE 
1 a bote pronto prep + n (masc; sg) + adj (masc; sg) out of the blue 
2 a caballo prep + n (masc; sg) on horseback 
3 a escondidas prep + pp (fem; pl) behind somebody's 
back 
4 a fondo prep 4- n (masc; sg) in depth 
5 a grito pelado prep + n (masc; sg) + adj (masc; sg) at the top of 
one's lungs 
6 a gusto prep 4- n (masc; sg) at ease 
7 alacarrera prep + det (fem; sg) + n (fem; sg) in a rush 
8 ala fuerza prep + det (fem; sg) + n (fem; sg) by force 
9 alaperfección prep + det (fem; sg) + n (fem; sg) to perfection 
10 a la vez prep + det (fem; sg) + n (fem; sg) all at once 
11 a la vista prep + det (fem; sg) + n (fem; sg) in sight 
12 a las mil prep + det (fem; pl) + adj + n (fem; pl) perfectly 
maravillas 
13 a manos prep + n (fem; pl) + adj (fem; pl) hand over fist 
llenas 


14 a medias 

15 a oscuras 

16 a secas 

17 a tientas 

18 a toda 
velocidad 

19 al por mayor 


20 codo con codo 
21 con las manos 
en la masa 

22 contra reloj 

23 con una mano 
delante y otra 
detrás 

24 de buenas 

25 de cabo a rabo 

26 de golpe y 
porrazo 

27 de reojo 


prep + adj (fem; pl) 
prep + adj (fem; pl) 
prep + adj (fem; pl) 
prep + n (fem; pl) 


prep 1 


prep + det (masc; sg) 


prep 1 


prep + 


prep 1 
prep ^ 
prep ^ 


prep 1 


+ adj (masc; sg) 
n (masc; sg) 4- prep 4- n (masc; sg) 
- det (fem; pl) + n (fem; pl) + prep 
+ det (fem; sg) + n (fem; sg) 
prep + n (masc; sg) 
- det (fem; sg) + n (fem; sg) + adv 
+ conj + adj (fem; sg) + adv 


- adj (fem; pl) 


- n (masc; sg) 


H adj (fem; sg) + n (fem; sg) 


+ prep 


+ n (masc; sg) + prep + n (masc; sg) 
- n (masc; sg) + conj + n (masc; sg) 


halfway 

in the dark 
plainly 
blindly 

at full speed 


wholesale 


elbow-to-elbow 


red-handed 


against the clock 
from hand to mouth 


with all one's heart 
head to tail 
all of a sudden 


out of the corner 
of one's eye 
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28 en breve prep 4- adj (masc; sg) shortly/in due 
course 

29 en prep + n (fem; sg) consequently 
consecuencia 

30 en definitiva prep + adj (fem; sg) in conclusion 

31 en el acto prep + det (masc; sg) + n (masc; sg) in the act 

32 en líneas prep + n (fem; pl) + adj (fem; pl) by and large 
generales 

33 en pocas prep + adj (fem; pl) + n (fem; pl) in a nutshell 
palabras 

34 en secreto prep + n (masc; sg) in secret 

35 en suma prep + n (fem; sg) in short 

36 en un prep + det (masc; sg) + n (masc; sg) in a flash 
santiamén 

37 más o menos adv + conj + adv more or less 

38 ni más ni conj + adv + conj + adv no more, no less 
menos 

39 para colmo prep + n (masc; sg) to top it all 

40 por prep 4- n (fem; sg) by chance 
casualidad 

4] por cierto prep + adj (masc; sg) by the way 

42 por prep 4- adj (masc; sg) hence 
consiguiente 

43 por prep + pp (masc; sg) needless to say 
descontado 

44 por el prep + det (masc; sg) + adj (masc; sg) on the contrary 
contrario 

45 por supuesto prep + adj (masc; sg) of course 

46 sin embargo prep + n (masc; sg) nevertheless 

47 sin más ni prep + adv + conj + adv just like that 
más 

48 sin ton nison prep + n (masc; sg) + adv +n (masc; sg)without rhyme or 

reason 

49 una det (fem; sg) + n (fem; sg) quite a lot 

barbaridad 
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Spanish MWE 


Table 3: Conjunctional phrases. 


PoS pattern in Spanish 


English translation 


VD D AJ Ch UM Vë GO Pä rh 


=. 
© 


a fin de que 

a medida que 

a menos que 

así que 

con tal de que 
mientras que 
siempre que 

tan pronto como 
visto que 

ya que 


prep + n (masc; sg) + prep + conj 
prep + n (fem; sg) + conj 


prep + adv + conj 


adv + conj 

prep + adv + prep + conj 
adv + conj 

adv + conj 

adv + adv + conj 

adj + conj 

adv + conj 


in order to 

as 

unless 
consequently 
as long as 
while 
whenever 

as soon as 
since 
because 
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Table 4: Complex nominals. 


Spanish MWE PoS pattern in Spanish English translation 

1 abrebotellas n (masc; sg/pl) bottle opener 

2 aguafiestas n (masc/fem; sg/pl) spoilsport 

3 cascanueces n (masc; sg/pl) nutcracker 

4 correveidile n (fem/masc; sg) tell-tale 

5 lavavajillas n (masc; sg/pl) dishwasher 

6 limpiacristales n (fem/masc; sg/pl) window cleaner 

7 rascacielos n (masc; sg/pl) skyscrapper 

8 sacacorchos n (masc; sg/pl) bottle opener 

9 soplagaitas n (fem/masc; sg/pl) dumbbell 

10 pinchadiscos n (masc/fem; sg/pl) disc jockey 

11 complejo de Edipo n (masc; sg) 4- prep 4- Oedipus complex 
n (masc; sg) 

12 el día a día det (masc; sg) n (masc; sg) 4- everyday life 
prep + n (masc; sg) 

13 el día del juicio final ^ det (masc; sg) + n (masc; sg) + doomsday 


14 gripe aviar 
15 la flor y la nata 


16 la gran pantalla 


17 la teoría de la 
relatividad 


18 mucho ruido y pocas 
nueces 

19 perro ladrador, poco 
mordedor 

20 sentido del ridículo 


21 síndrome de down 
22 vergüenza ajena 


23 síndrome de 
Estocolmo 


prep + det (masc; sg) + 

n (masc; sg) 4- adj (masc; sg) 
n (fem; sg) + adj (fem; sg) 
det (fem; sg) 4- n (fem; sg) 4- 


conj 4- det (fem; sg) 4- n (fem; sg) 


art (fem; sg) + adj (fem; sg) + 
n (fem; sg) 

det (fem; sg) 4- n (fem; sg) 4- 
prep + det (fem; sg) + 

n (fem; sg) 

adj (masc; sg) + n (masc; sg) 4 
conj + adj (fem; pl) + n (fem; 
n (masc; sg) + adj (masc; sg) 4 
adv 4- adj (masc; sg) 
n (masc; sg) 4- prep 4- 


n (masc; sg) 

n (masc; sg) 4- prep 4- 

n (masc; sg) 

n (fem; sg) + adj (fem; sg) 


n (masc; sg) 4- prep 4- 
n (masc; sg) 


pl) 


avian influenza 
cream of the crop 


the big screen 


theory of relativity 


much ado about 
nothing 

his bark is worse 
than his bite 
self-concious 


Down Syndrome 
feel embarrassment 


for 
Stockholm Syndrome 
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Table 5: Proper names. 


Ob Lä Pä rä 


Spanish MWE 


Air Jordan 

Al Capone 

América Latina 
Amnistía Internacional 
Banco Central Europeo 


Billy el Nino 


Buenos Aires 
Costa Rica 
Cruz Roja 

el Cordobés 
El Greco 

El Pelusa 

El Principito 
Gran Bretaña 
Jose María 
La Paz 

La sombra del viento 


Lawrence de Arabia 
Lord Byron 

Los Ángeles 
Manchester United 
María Jose 


Médicos Sin Fronteras 


Mona Lisa 
Nueva York 
Nueva Zelanda 
Osa Mayor 
Países Bajos 
Papá Noel 
Real Academia 
Española 

Real Madrid 
Reino Unido 
República Dominicana 
San Salvador 
Unión Europea 


PoS pattern in Spanish 


n (masc; sg) + n (masc; sg) 

n (masc; sg) + n (masc; sg) 

n (fem; sg) + adj (fem; sg) 

n (fem; sg) + adj (fem; sg) 

n (masc; sg) + adj (masc; sg) + 
adj (masc; sg) 
n (masc; sg) + det (masc; sg) + 
n (masc; sg) 

adj (masc; pl) + n (masc; pl) 

n (fem; sg) + adj (fem; sg) 

n (fem; sg) + adj (fem; sg) 

det (masc; sg) + adj (masc; sg) 
det (masc; sg) + adj (masc; sg) 
det (masc; sg) + n (fem; sg) 

det (masc; sg) + n (masc; sg) 

adj (fem; sg) 4- n (fem; sg) 

n (masc; sg) 4- n (fem; sg) 

det (fem; sg) + n (fem; sg) 

det (fem; sg) + n (fem; sg) + prep 
+ det (masc; sg) + n (masc; sg) 

n (masc; sg) + prep + n (fem; sg) 
n (masc; sg) 4- n (masc; sg) 

det (masc; pl) + n (masc; pl) 

n 4- adj 

n (fem; sg) + n (masc; sg) 

n (masc; pl) 4- prep 4- n (fem; pl) 


n (fem; sg) + n (fem; sg) 
adj (fem; sg) 4- n (fem; sg) 
adj (fem; sg) 4- n (fem; sg) 

n (fem; sg) + adj (fem; sg) 

n (fem; sg) + adj (fem; sg) 

n (masc; sg) 4- n (masc; sg) 

adj (fem; sg) + n (fem; sg) + adj 
(fem; sg) 

adj (masc; sg) + n (masc; sg) 

n (masc; sg) 4- adj (masc; sg) 

n (fem; sg) + adj (fem; sg) 

adj (fem; sg) + n (masc; sg) 

n (fem; sg) + adj (fem; sg) 


English translation 


Air Jordan 

Al Capone 

Latin America 
Amnesty International 
European Central Bank 


Billy the Kid 


Buenos Aires 
Costa Rica 

Red Cross 

el Cordobés 

El Greco 

el Pelusa 

The Little Prince 
Great Britain 

Jose Maria 

La Paz 

The Shadow of the 
Wind 

Lawrence of Arabia 
Lord Byron 

Los Angeles 
Manchester United 


Doctors Without 
Borders 

Mona Lisa 

New York 

New Zealand 

Ursa Major 

the Netherlands 
Father Christmas 
Royal Spanish 
Language Academy 
Real Madrid 

United Kingdom 
Dominican Republic 
San Salvador 
European Union 
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Table 6: Prepositional phrases. 


Spanish MWE PoS pattern in Spanish English translation 
1 por culpa de prep + n (fem; sg) + prep because of 
2 apesar de prep + n (masc; sg) + prep in spite of 
3 ol margen de prep + det (masc; sg) + n (masc; sg) + prep apart from 
4  conmirasa prep + n (fem; sg) + prep looking to 
5 deconformidad con prep + n (fem; sg) + prep according to 
6 en contra de prep + n (fem; sg) + prep in opposition to 
7 encuanto a prep + adverb + prep with regard to 
8 en detrimento de prep + n (masc; sg) + prep at the expense of 
9 en relación con prep + n (fem; sg) + prep in relation to 
10 respectoa n (masc; sg) 4- prep in relation to 
Table 7: Sentential expressions. 
Spanish MWE PoS pattern in Spanish English translation 
1 cuando el río suena, conj + det (masc; sg) + n (masc; where there's smoke, 
agua lleva sg) + v (3rd pers; sg) + n (fem; ` there's fire 
sg) + v (3rd pers; sg) 
2 cuando las ranas adv + det (fem; pl) + n (fem; pl) when pigs fly 
críen pelo + v (3rd pers; pl) + n (masc; sg) 
3  dimecon quién andas v (2nd pers; sg) + prep + pron birds of a feather 
y te diré quién eres + v (2nd pers; sg) + conj + pron flock together 
+ v (Ist pers; sg) + pron + v (2? 
pers; sg) 
4 más vale tarde que adv + v (3rd pers; sg) + adv + better late than never 


nunca 


conj + adv 
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Appendix B Spanish Semi-fixed MWEs data set 


Table 8: Adjectival compounds. 


Spanish MWE PoS pattern in Spanish English translation 
1 agridulce adj (masc/fem; sg) sweet-and-sour/bittersweet 
2  boquiabierto adj (masc; sg) open-mouthed 
3  cabizbajo adj (masc; sg) downcast 
4  cejijunto adj (masc; sg) unibrow 
5 drogadicto adj (masc; sg) drug addict 
6 hispanohablante adj (masc/fem; sg) Spanish-speaking 
7 narcotraficante adj (masc/fem; sg) drug dealer/drug trafficker 
8  patidifuso adj (masc; sg) astonished 
9  pelirrojo adj (masc; sg) redheaded 
10  vasodilatador adj (masc; sg) vasodilator 

Table 9: Adjectival phrases. 

Spanish MWE PoS pattern in Spanish English translation 
1 de primera mano prep + adj + (fem; sg) first hand 
2 sano y salvo adj 4- conj 4- adj safe and sound 

Table 10: Adverbial expressions. 

Spanish MWE PoS pattern in Spanish English translation 

1 a golpes prep 4- n (masc; pl) violently 
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Spanish MWE 


Table 11: Complex nominals. 


PoS pattern in Spanish 


English translation 


NDOT BW 


la ley de la jungla 


anillo de 
compromiso 
bicicleta estática 
bocacalle 
bomba nuclear 
café con leche 
campo de 
concentración 
centro de salud 
cinta de correr 
ciudad dormitorio 
complejo de 
inferioridad 
crema de manos 
cuenta de débito 
cuenta de 
resultados 
cuento chino 
deporte de 
aventura 

diente de león 
disco pirata 

fin de semana 
goma de borrar 
guerra civil 
hombre lobo 
hueso duro de 
roer 

impuesto 
revolucionario 
infarto de 
miocardio 


det (fem; sg) + n (fem; sg) + prep 
+ det (fem; sg) + n (fem; sg) 
n (masc; sg) 4- prep 4- n (masc; sg) 


n (fem; sg) + 


( 
( 
n (fem; sg) + 
( 
( 


fem; sg) + 


masc; sg) 4 


n (fem; sg) + 
n (fem; sg) + 


adj (fem; sg) 


adj (fem; sg) 
- prep 4- n (fem; sg) 
- prep 4- n (fem; sg) 


- prep 4- n (fem; sg) 
prep + inf 

n (masc; sg) 

- prep + n (fem; sg) 


prep 4- n (fem; pl) 
prep 4- n (masc; sg) 


n (masc; sg) 4 
n (masc; sg) 4 


n (masc; sg) 4 


prep + n (masc; pl) 


- adj (masc; sg) 
- prep 4- n (fem; sg) 


- prep 4- n (masc; sg) 
- n (masc; sg) 

- prep 4- n (fem; sg) 
prep + inf 

adj (fem; sg) 

- n (masc; sg) 

- n (masc; sg) + adj 


(masc; sg) + prep + inf 


n (masc; sg) 4 


- adj (masc; sg) 


n (masc; sg) 4 


- prep 4- n (masc; pl) 


law of the jungle 
engagement ring 


exercise bike 
side-street 

nuclear bomb 
coffee with milk 
concentration camp 


health center 
treadmill 
dormitory town 
inferiority complex 


hand cream 
debit account 
profit and 

loss account 

a tall tale 
adventure sport 


dandelion 

pirate CD 
weekend 

eraser 

civil war 
werewolf 

hard nut to crack 


revolutionary tax 


myocardial 
infarction 
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la gallina de los 
huevos de oro 


la ley del más fuerte 


lobo con piel de 
cordero 

mesa camilla 
momento clave 
nino mimado 
nino prodigio 
patata caliente 
perro de caza 
raíz cuadrada 
realidad virtual 
renta per cápita 
ruleta rusa 

salto mortal 
sentimiento de 
culpa 

tarjeta de crédito 
tortilla de patata 
zumo de naranja 


det (fem; sg) + n (fem; sg) + prep 


prep + n (ma 


(masc; sg) 
n (masc; sg) 4 


n (fem; sg) + 
n (masc; sg) 4 
n (masc; sg) 4 
n (masc; sg) 4 
n (fem; sg) + 
n (masc; sg) 4 
n (fem; sg) + 
n (fem; sg) + 
n (fem; sg) + 
n (fem; sg) + 
n (masc; sg) 4 
n (masc; sg) 4 


n (fem; sg) + 
n (fem; sg) + 


+ det (masc; pl) + n (masc; pl) + 


sc; sg) 


det (fem; sg) + n (fem; sg) + prep 
+ det (masc; sg) + adv + adj 


- prep 4- n (fem; sg) 


+ prep + n (masc; sg) 


n (fem; sg) 

- n (fem; sg) 

- adj (masc; sg) 

- n (masc; sg) 

adj (fem; sg) 

- prep 4- n (fem; sg) 
adj (fem; sg) 

adj (fem; sg) 

prep + n (fem; sg) 
adj (fem; sg) 

- adj (masc; sg) 

- prep 4- n (fem; sg) 


prep 4- n (masc; sg) 
prep 4- n (fem; sg) 


n (masc; sg) + prep + n (fem; sg) 


cash cow 


survival of the 
fittest 


wolf in sheep's 
clothing 

round table 

key moment 
blue-eyed boy 
child prodigy 
hot potato 
hunting dog 
square root 
virtual reality 
income per capita 
Russian roulette 
somersault 
guilt 


credit card 
Spanish omelette 
orange juice 
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Table 12: Verbal phrases. 


Spanish MWE 


1 coger el toro por 
los cuernos 


2 echar por tierra 
3 empezar la casa 
por el tejado 


4 estar como unas 
castanuelas 

5 irde guatemala a 
guatepeor 

6  nipinchar ni 
cortar 

7 ser de armas 
tomar 

8  serelojito 
derecho 

9  serharina de otro 
costal 

10  serla créme de la 
créme 


1l vivir a cuerpo de 


PoS pattern in Spanish 


v 4- det (masc; sg) + n (masc; 
sg) + prep + det (masc; pl) + n 
(masc; pl) 

v + prep + n (fem; sg) 

v + det (fem; sg) + n (fem; sg) 


+ prep + det (masc; sg) + n 
(masc; sg) 

v + adv + det (fem; pl) + n 
(fem; pl) 


v + prep + n (fem; sg) + prep 
+ n (masc; sg) 
conj + v + conj + v 


v + prep + n (fem; pl) + verb 


v 4- det (masc; sg) + n (masc; 
sg) 4- adj (masc; sg) 

v 4- n (fem; sg) + prep + adj 
(masc; sg) + n (masc; sg) 

v 4- det (fem; sg) 4- n (fem; sg) 
+ prep + det (fem; sg) + n 
(fem; sg) 

v + prep + n (masc; sg) + prep 


English translation 


to take the bull by the 


horns 


to upset the applecart 
to put the cart before the 
horse 


to be tickled pink 


out ofthe frying pan and 
into the fire 
to cut no ice 


to be someone to be 
reckoned with 

to be the apple of one's 
eye 

to be a horse of a 
different colour 

to be créme de la créme 


to live high on the hog 


rey + n (masc; sg) 
Table 13: Sentential expressions. 
Spanish MWE PoS pattern in Spanish English translation 


1 lagota que colma 
el vaso 


det (fem; sg) 4- n (fem; sg) 4- conj 4- 
v (3rd pers; sg) + det (masc; sg) + n 


(masc; sg) 


straw that breaks the 
camel's back 
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Appendix C Spanish Flexible MWEs data set 


Table 14: Adjectival phrases. 


entender 


Spanish MWE PoS pattern in Spanish English translation 
1 de cuidado prep + n (masc; sg) dangerous 
2 de ensueño prep + n (masc; sg) fantastic 

Table 15: Adjectives with a governed prepositional phrase. 
Spanish MWE PoS pattern in Spanish English translation 
1 adictoa adj (masc; sg) + prep addicted to 
2 aficionado a adj (masc; sg) + prep fond of 
3 apto para adj (masc; sg) + prep suitable for 
4  aspirantea adj (masc/fem; sg) + prep candidate for 
5 carente de adj (masc/fem; sg) + prep ` deprived of 
6 casado con adj (masc; sg) + prep married to/with 
7  celoso de adj (masc; sg) + prep jealous of 
8 culpable de adj (masc/fem; sg) + prep guilty of 
9 dependiente de adj (masc/fem; sg) 4- prep dependent on 
10  exento de adj (masc; sg) + prep exempt from 
11 interesado en adj (masc; sg) + prep interested in 
12 preocupado por adj (masc; sg) + prep worried about 
13 sospechoso de adj (masc; sg) + prep suspected of 
Table 16: Adverbial expression. 

Spanish MWE PoS pattern in Spanish English translation 

1 a mi/tu/su/nuestro/vuestro prep + pos + n (masc; sg) by my/your/her/his/our/ 


their understanding 
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Table 17: Nouns with a governed prepositional phrase. 


Spanish MWE 


1 actitud con/hacia/respecto 


de 


PoS pattern in Spanish 


n (fem; sg)* prep 


English translation 


attitude 
with/towards/regarding 


2 amenaza de n (fem; sg) + prep threat of 
3 asalto a/de n (masc; sg) + prep assault to/on 
4 confianza en n (fem; sg) + prep trust in 
5 esperanza de n (fem; sg) + prep hope to 
6 interés por n (masc; sg) 4- prep interest in 
7 olora n (masc; sg) + prep smell of 
8 prohibición de n (fem; sg) + prep prohibition of 
9 sabora n (masc; sg) 4- prep taste of 
10 salida de n (fem; sg) + prep exit of 
11 traducción a n (fem; sg) + prep translation to 
12 vetoa n (fem; sg) + prep ban on 
Table 18: Light verb constructions. 
Spanish MWE PoS pattern in Spanish English translation 


1 cantar las cuarenta 


2 comer la olla 


3 cortar el bacalao 


4 dar acidez 

5 dar ánimos 

6 dar calor 

7 dar carpetazo 
8 dar esquinazo 
9 dar la palabra 
10 darla tabarra 
11 dar plantón 
12 dar suerte 


v4 
v4 


v4 


yä 
yi 
v 
ee 
ER 
ve 
v4 
v4 
v4 


- n (fem; sg) 

- n (masc; pl) 
- n (masc; sg) 
- n (masc; sg) 
- n (masc; sg) 


- n (masc; sg) 


- n (fem; sg) 


+ det (fem; sg) 4 
+ det (fem; sg) 4 


+ det (fem; pl) + adj (fem pl) 
+ det (fem; sg) + n (fem; sg) 


+ det (masc; sg) + n (masc; sg) 


+ n (fem; sg) 
+ n (fem; sg) 


to haul over the coals 
to talk someone into 
something 

to be the big 
cheese/big fish 

to produce heartburn 
to cheer up 

to keep warm 

to put an end to 

to give the slip 

to give the floor to 
to pester 

to stand [sb] up 

to give [sb] luck 
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dar un beso 
dar una patada 
dar un paseo 


dar un puñetazo 


despertar el 
apetito 

echar la siesta 
echar un cable 


empinar el codo 


hacer alusión 
hacer añicos 
hacer gracia 
hacer ilusión 


hacer la compra 


hacer la pelota 
hacer un trato 
hacer una foto 


hacer una oferta 
hacerse ilusiones 


levar anclas 


llamar la atención 


pasar la pelota 


ponerse las pilas 


sacar pecho 


sufrir las 
consecuencias 
tener gana 
tener ganas 
tomar el pelo 
tomar el sol 
tomar partido 
tomar una 
decisión 


v 4- det (masc; sg) + n (masc; sg) 
v 4- det (fem; sg) 4- n (fem; sg) 

v 4- det (masc; sg) + n (masc; sg) 
v 4- det (masc; sg) + n (masc; sg) 
v 4- det (masc; sg) + n (masc; sg) 
v 4- det (fem; sg) 4- n (fem; sg) 

v 4- det (masc; sg) + n (masc; sg) 
v 4- det (masc; sg) + n (masc; sg) 
v 4- n (fem; sg) 

v 4- n (masc; pl) 

v 4- n (fem; sg) 

v 4- n (fem; sg) 

v + det (fem; sg) + n (fem; sg) 

v + det (fem;sg) + n (fem; sg) 

v 4- det (masc; sg) + n (masc; sg) 
v 4- det (fem; sg) 4- n (fem; sg) 

v + det (fem; sg) + n (fem; sg) 
refl v + n (fem; pl) 

v 4- n (fem; pl) 

v 4- det (fem; sg) 4- n (fem; sg) 

v + det (fem; sg) + n (fem; sg) 


refl v + det (fem; pl) + n (fem; pl) 


v4 


xs 
vi 
d 
ER 
T 


nd 


- n (masc; sg) 


+ n (fem; sg) 
+ n (fem; pl) 


- n (masc; sg) 


- det (fem; pl) + n (fem; pl) 


- det (masc; sg) + n (masc; sg) 
- det (masc; sg) + n (masc; sg) 


- det (fem; sg) 4- n (fem; sg) 


to give a kiss 

to kick 

to go for a walk 

to punch 

to awaken one's 
apettite 

to take a nap 

to give a hand 

to bend one's elbow 
to make an allusion 
to break into pieces 
to be funny 

to look forward to 
to do the shopping 
to suck up to 

to make a deal 

to take a picture 

to make an offer 

to get one's hopes up 
to weigh anchor 

to attract one's 
attention 

to pass the buck 

to get one's 

act together 

to stick your chest 
out 

to suffer the 
consequences 

to be hungry 

to feel like 

to tease [someone] 
to sunbathe 

to take sides 

to make a decision 
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Table 19: Periphrastic constructions. Periphrastic constructions do not 
have straightforward English translations. The ones give here are an 
indication of what the usually mean but the translations will depend 
on the verb appearing in a non-finite form in the periphrasis. 


Spanish MWE 


PoS pattern in Spanish English translation 
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m pe E 
Ne c 


LA 
E 


acabar de 4- inf 
andar 4- ger 
deber 4- inf 
deber de 4- inf 
empezar a 4- inf 
estar por 4- inf 
haber de + inf 
haber que 4- inf 
ir + ger 

ir a + inf 
llegar a + inf 
llevar + ger 
llevar + pp 
poder + inf 
sacar a + inf 
seguir + ger 
tener que + inf 
venir + ger 
venir a + inf 
volver a + inf 


v + prep + inf 
v + ger 
v + inf 
v + prep + inf 
v + prep + inf 
v + prep + inf 
v + prep + inf 
v + pron + inf 
v + ger 
v + prep + inf 
v + prep + inf 
v + ger 
Y — pp 
v + inf 
v + prep + inf 
v + ger 
v + pron + inf 
v 4 ger 
v + prep + inf 
v + prep + inf 


to finish to 

to be doing 

to have to 

to may have 

to begin to 

to be about to 

to have to 

to have to 

to begin/be doing 

to go to 

to manage to 

to have been doing 

to have done 

to be able to 

to take someone out to 
to continue doing 

to have to 

to have been doing 

to be 

to do something again 
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Table 20: Verbal phrases. 


Spanish MWE 


PoS pattern in Spanish 


English translation 


dar por sentado 
entrar al trapo 


estar al pie del 
canon 
estar en Babia 


estar en las nubes 


hacer una 
montana de 


irse de la lengua 


irse de rositas 
irse por las ramas 


llamar a la puerta 
equivocada 
salir al paso 


salir de cuentas 
salir de marcha 


saltar a la comba 


ser fiel a 


v + prep + adj (masc; sg) 
v + prep + det (masc; sg) + n 


(masc; sg) 


v + prep + det (masc; sg) + n 


(masc; sg) 


n (masc; sg) 


+ prep + det (art; sg) + 


v + prep + n (fem; sg) 
v + prep + det (fem; pl) + noun 


(fem; pl) 


v + det (fem; sg) + n (fem; sg) + 
prep + det (masc; sg) + n (masc; 
sg) + prep + n (fem; sg) 


refl v + prep + det (fem; sg) + n 
(fem; sg) 

refl v + prep + n (fem; pl) 

refl v + prep + det (fem; pl) + n 
(fem; pl) 


v + prep + det (fem; sg) + n (fem; 
sg) 4- adj (fem; sg) 
v + prep + det (masc; sg) + n 


(masc; sg) 


v + prep + n (fem; pl) 
v + prep + n (fem; sg) 
v + prep + det (fem; sg) + n (fem; 


sg) 


v + adj (masc/fem; sg) + prep 


take for granted 
to respond to 
provocations 

to be ready and 
waiting 


to be daydreaming 
to be in the clouds 


make a mountain out 
of a molehill 


to let the cat out of 
the bag 

to get off scot free 

to beat around the 
bush 

to bark up the wrong 
tree 

to refute 


to be due 
to go partying 
to skip rope 


to be loyal to 
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Table 21: Verbs with a governed prepositional phrase. 


oO 0 A Ch Ui Q Nn 


Spanish MWE 


abstenerse de 
acordarse de 
amenazar con 
arrepentirse de 
atenerse a 
confiar en 
contribuir a 
creer en 
cuidar de 
empenarse en 
engancharse a 
entender de 
eximir de 
gozar de 
hablar de/sobre/acerca de 
interesarse por 
oler a 

pelear por 
quedarse con 
referirse a 
viajar a/hacia/hasta 


PoS pattern 


refl v 4 
refl v 4 


" prep 
r prep 


V + prep 


refl v - 
refl v 4 


" prep 
" prep 


V + prep 
v + prep 
v + prep 
v + prep 


refl v - 
refl v 4 


r prep 
r prep 


v + prep 
v + prep 
v + prep 
v + prep 


refl v - 


prep 


v + prep 
v + prep 


refl v - 
refl v - 


r prep 
r prep 


V + prep 


English translation 


to refrain yourself from 
to remember 

to threaten to 

to regret 

to stick to 

to trust in 

to contribute to 
to believe in 

to take care of 

to insist on 

to get hooked on 
to know about 

to exempt from 

to enjoy 

to talk about/of 
to be interested in 
to smell like 

to fight for 

to keep 

to refer to 

to travel to/towards 
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idiomaticity, 125 
multiword term, 287 
non-decomposable, 2—5, 15-22 
phrasal verb, 276, 286 
proverb, 277 
semi-fixed, 73, 299, 311 
semi-flexible, 149, 281, 282, 291 
syntactically irregular, 4, 132, 
288 
verbal, 4, 64, 67 
word with spaces, 209, 288 
‘winged words’, 124 
MWE, see multiword expression 


name, ix, xiii, 32-34, 274, 295, 309 
location name, 32 
location names, 47 
multiword name, 33, 35-37, 275 
organization name, 32 
personal name, 32 
named entity, see name 
Natural Language Processing, 64, 
154, 155, 165, 170, 272, 274- 
277 
NLP, see Natural Language Process- 
ing 


object 
direct object, 188-193 
indirect object, 188-193 


passive, vii, xvii, 4, 5, 8-9, 11-13, 17", 
146°, 17-203, 284 

passivization, see passive 

pattern, 37, 41, 95, 97 


Subject index 


Pattern Dictionary of English Verbs, 
97 

PDEV, see Pattern Dictionary of En- 
glish Verbs 

periphrastic construction, 298-299 

phonotactics, 122 

phrasal verb, 144, 288 

phraseological unit, see multiword 
expression 

PMI, see Pointwise Mutual Informa- 
tion 

Point-wise Mutual Information, 104 

polylexicality, v 

proper name, see name 

proper noun, see name 


reproducible observation, 159-170, 
175, 177 
rigidity, 105 


SBCG, see Sign-Based Construction 
Grammar 
semantic evaluation, 161 
semantics 
compositionality, 12-15, 71, 
13655, 1508, 15516, 176 
differential semantic evalua- 
tion, 161 
Lexical Resource Semantics, 2, 
13-15, 18 
Minimal Recursion Semantics, 
13 
non-compositionality, 76, 84, 
125, 131 
underspecification, 13-15 
Sign-Based Construction Grammar, 
xix, 13-14 
simile, xii 
standard deviation, 108 
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Subject index 


support verb, see light verb 

support verb construction, see light 
verb construction 

SVC, see light verb construction 

syntactic construction, see syntactic 
operation 

syntactic operation, 146, 152-159, 
165, 173-175, 178 

syntactic reorganization, 228-229, 
237-238 

syntactic variability, 216 


taxonomy, xx-xxii, 143-144, 146°, 
149, 150, 155, 158, 172, 175- 
180, 248-250, 274-278, 286 
thematic role 
experiencer, 79-80 
ObjectExperiencer, 79 
SubjectExperiencer, 79 
translation equivalent, 263 
Tree-Adjoining Grammar, 155 
trigger, 33 
truncation, 122, 126, 134 


Word Sense Disambiguation , 94 
WSD, see Word Sense Disambigua- 
tion 
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Multiword expressions 


Multiword expressions (MWEs) are a challenge for both the natural language applica- 
tions and the linguistic theory because they often defy the application of the machinery 
developed for free combinations where the default is that the meaning of an utterance 
can be predicted from its structure. 

There is a rich body of primarily descriptive work on MWEs for many European 
languages but comparative work is little. The volume brings together MWE experts to 
explore the benefits of a multilingual perspective on MWEs. The ten contributions in 
this volume look at MWEs in Bulgarian, English, French, German, Maori, Modern Greek, 
Romanian, Serbian, and Spanish. They discuss prominent issues in MWE research such 
as classification of MWEs, their formal grammatical modeling, and the description of 
individual MWE types from the point of view of different theoretical frameworks, such 
as Dependency Grammar, Generative Grammar, Head-driven Phrase Structure Grammar, 
Lexical Functional Grammar, Lexicon Grammar. 


